bioboxes / rfc

Request for comments on interchangeable bioinformatics containers
http://bioboxes.org
MIT License
40 stars 9 forks source link

Format of FASTQ #10

Closed fungs closed 9 years ago

fungs commented 9 years ago

Which format should FASTQ files be in. I know there are at least some ways to allow for line breaks or not (the latter is more restricted and easier to parse). There should be a reference to every file format definition. If there is no stable definition on the web, we might have to provide it...

Should CONT_PAIRED_FASTQ_FILE/CONT_PAIRED_FASTQ_FILE/CONT_PAIRED_FASTQ_FILE_LISTING contain the path to a single interleaved FASTQ file or to a forward and reverse file? Again, I would put a link to a definition.

Johannes

abremges commented 9 years ago

Which format should FASTQ files be in. I know there are at least some ways to allow for line breaks or not (the latter is more restricted and easier to parse). There should be a reference to every file format definition. If there is no stable definition on the web, we might have to provide it...

Unreferenced community standard is single-line sequence and quality, quality values encoded Phred+33 (i.e. either Sanger or Illumina 1.8+). I guess this was pretty much driven by the native sequencer output, in particular the Illumina file format. I like its Wikipedia entry: http://en.wikipedia.org/wiki/FASTQ_format

Most tools can handle 4-line FASTQ in some way, but might break with multi-line sequence/quality entries. Valid sequence identifiers and description are another thing where some tools behave ... weirdly. The ID is the first word after '@', anything after the first space character belongs to the optional description. Flawed tools might alter the ID (e.g. cut after xx characters) and/or drop the description. We should encourage neither.

Should CONT_PAIRED_FASTQ_FILE/CONT_PAIRED_FASTQ_FILE/CONT_PAIRED_FASTQ_FILE_LISTING contain the path to a single interleaved FASTQ file or to a forward and reverse file? Again, I would put a link to a definition.

If I recall correctly, we agreed on interleaved FASTQ files. Good point, though.

As far as I know (correct me if I'm wrong), Illumina machines produce two files: one forward and one reverse. At the JGI, these files are merged immediately to ease storage and management. Many tools, however, expect those seperate files. I'm torn - I like the idea of having a single file only, but it often complicates things (I look at you, bowtie!). We probably should settle for the native sequencer output here, too.

fungs commented 9 years ago

My point was that all of these details should become part of the specification.

I also prefer a single files for paired FASTQ data and I'm usually convert single files to forward and reverse files on-the-fly using UNIX FIFOs. Unfortunately not all tools are able to work with data streams.

abremges commented 9 years ago

Agreed, we should include such little, but important, details in the specification. And from my point of view, we shouldn't go for personal taste, but vendor-driven, and thus most widely used, format conventions. The vast majority of all sequence data is produced on Illumina machines, so we should check how these natively deliver their read file(s) from the sequencing machines.

fungs commented 9 years ago

For future versions: Why do FASTQ files be gzipped? If I have files which are not compressed or compressed more aggressively using for example xz, how do I specify them? (I think an ontology would help here).

michaelbarton commented 9 years ago

I agree that this is an important issue, particularly because FASTA/Q both are underspecified.

Unreferenced community standard is single-line sequence and quality, quality values encoded Phred+33 (i.e. either Sanger or Illumina 1.8+)

I agree with this as this seems to be the standard produced by current Illumina machines.

most widely used, format conventions.

From developing containers for nucleotid.es, I have observed that the majority of assemblers accept interleaved FASTQ in gz format.

michaelbarton commented 9 years ago

For future versions: Why do FASTQ files be gzipped? If I have files which are not compressed or compressed more aggressively using for example xz, how do I specify them? (I think an ontology would help here).

I think this is a very good point. I was thinking about this while writing #9. My ideal solution would be to drop FASTA/FASTQ formats altogether and instead use a structure data format such as YAML.

If we consider a FASTQ trimmer that provides the following morphism:

f : [(Q,Q)] → [(Q,Q)]

This morphism takes a series of paired FASTQ entries and returns the same with some modifications to the read sequence. The problem I see here is that this specification is untrue as the morphism is really:

f : [X] → [X]

Both sets of files are just long arrays of lines without any formal structure and so you have to have out-of-bandwidth information about how FASTQ is structured to fully understand the meaning of each line. This is further compounded as paired and unpaired FASTQ are essentially identical. If we instead used a structured we could write more morphism definitions such as:

f : [(I,S,Q)] → [(I,S,Q)]

Where I = identifier S = nucleotide sequence Q = quality values

This unambiguously defines the input and output as a list of tupes where the position in the tuple is part of an ontology, similarly as suggested by Johannes.

f : [((I,S,Q), (I,S,Q))] → [(I,S)]

This is an example of how a genome assembler might be similarly encoded. Here it is clear that it is paired reads because it is a list of 2-tuples. You can guess the format of the output also since it is tuples of identifiers and sequence.

f : [((I,S,Q), (I,S,Q))] → [(I,T)]

Here's another example of how the read binner might be specified using the same format. A list of identifier and taxonomy 2-tuples.

With the issue of using gz, bzip2 or xz you could enforce that the assembler reads the type of compression from the file type. While the underlying data structure would be the same.

touch log
xz log
file log.xz

This is quite a large task though and so should possibly be left until after 1.0.

fungs commented 9 years ago

In this context, it could make sense to formally distinguish between data/information and format types. Thee compression type is a form of the latter. This would allow for automatic conversion if containers are not capable to handle a specific format (e.g. using signature matching).

michaelbarton commented 9 years ago

There is no formal definition for FASTA or FASTQ in the same way there are RFC definitions for JSON and CSV. Furthermore I do not think we can enforce a strict version of either of these formats. Instead we can include community definitions for these formats and hope the container returns a sensible error message and non-zero exit code if given incorrect FASTQ or FASTA.

We should include the MAQ grammar as the specification for FASTQ, and the posted BioStar grammar for FASTA. This at least will have a definition we can use and update if necessary. We can also use the definition of FASTQ published in a NAR article.

michaelbarton commented 9 years ago

In addition to this grammer we should state that the data should use Phred+33 offsets for the quality values.

michaelbarton commented 9 years ago

I have created a pull request to resolve this commit

michaelbarton commented 9 years ago

Resolved by #24

fungs commented 9 years ago

I'm adding this for completeness which covers the definition of the FASTQ format: doi://10.1093/nar/gkp1137

michaelbarton commented 9 years ago

That paper is included in the description of the FASTQ format - https://github.com/bioboxes/rfc/blob/master/data-format/sequence.mkd#description