bhattlab / bhattlab_workflows

Computational workflows for metagenomics tasks, by the Bhatt lab
http://www.bhattlab.com
46 stars 15 forks source link

syncing script is not flexible with header formats #19

Closed tamburinif closed 5 years ago

tamburinif commented 5 years ago

It appears that the syncing script cannot handle fastq headers that are not in standard illumination format and puts all reads into the orphans file. This commonly affects data downloaded from SRA, for example. We definitely need to fix this asap.

tamburinif commented 5 years ago

edit: this works for some SRA data but not all

bsiranosian commented 5 years ago

Can you post an example of headers where this fails? You could also adjust the regex on line 59 to match what your sra reads have.

tamburinif commented 5 years ago

It looks like any header that ends in /1 or /2 fails, which isn't great eg: @HWUSI-EAS740_103031124:1:100:10000:11286/1

I made my own version of the script with adjusted regex which I can share if needed