MueFab / genie

Open Source MPEG-G Codec
Other
35 stars 6 forks source link

Fastq-Exporter: Add support for mixed single/paired end data #9

Open MueFab opened 4 years ago

MueFab commented 4 years ago

Describe the bug If paired and unpaired unaligned data units are contained in a single mgb file and that file is decompressed into fastq, unpaired records are written into the output files alongside the paired records. This creates an offset and destroys the pairing information of the paired records.

To Reproduce

  1. Create an mgb file containing 2 access units with unaligned data, one with paired reads and one with unpaired reads.
  2. Decompress the file with genie.
  3. The generated fastq files have different sizes and the pairing information is destroyed if the single paired access unit is decoded before the paired one.

Expected behavior A better way would be to generate three fastq files, 2 for paired records and one for all single ended reads, therefore preserving the pairing information. Unused files could be deleted when the application shuts down.

shubhamchandak94 commented 4 years ago

This is also applicable to fastq decompression on the feature/paired-end branch when you don't specify --combine-pairs. You get some records with only one segment, and some records with 2. It might be worthwhile to write out the records with 1 segment into separate files (can potentially have 4 files: matched_1.fastq, matched_2.fastq, unmatched_1.fastq, unmatched_2.fastq)