chanzuckerberg / idseq-workflows

Portable WDL workflows for IDseq production pipelines
https://idseq.net/
MIT License
31 stars 12 forks source link

idseq_dag/util/fasta.py: don't mangle FASTQ read names with pipe characters #62

Closed mlin closed 3 years ago

mlin commented 3 years ago

Change intermediate delimiter used by util.fastq.sort_fastx_by_entry_id() from | to ` in hopes the latter is even less likely to arise in read names, causing the output to be mangled. (I'm working with some CAMI challenge datasets that use pipes in their read names)

mlin commented 3 years ago

Thanks @tfrcarvalho -- ideally yea, but I think it's a small risk probably not worth spending the time to re-engineer anytime soon. Pipe isn't too unusual of a character to see used as a delimited "in the wild" (NCBI uses it a lot for example). Famous last words, but we'll probably be okay with backtick ;)

mlin commented 3 years ago

LOL, and 5 minutes later I remembered that backtick can appear in the FASTQ quality scores for older Illumina data. According to the diagram there, even pipe can appear there for PacBio data!

Maybe we should fix this properly after all, or see if it'll work with a nonprintable character. Reverted for now