HRGV / phyloFlash

phyloFlash - A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
GNU General Public License v3.0
77 stars 25 forks source link

Issue comparing databases and locating reads used when -readlimit applied #157

Open mmcardozo opened 2 years ago

mmcardozo commented 2 years ago

Dear Creators,

I ran twice phyloflash with different data bases; one with PR2 and used the -readlimit to only used the first 3000000 reads, the line of code I used was: phyloFlash.pl -lib run01 -read1 /work/ollie/mcardozo/MetaG_MetaT/Sorted/1_rRNA.fastq -readlength 150 -keeptmp -dbhome /work/ollie/mcardozo/MetaG_MetaT/phyloflash_for_euks/414 -html -log -zip -readlimit 3000000 -taxlevel 8, and the second time I used the standard SILVA 138 also with the first 3000000 reads. Looking into the output folder of both tests, the files x.all.vsearch.csv that provides the the reads (phyloflash headings), accession number and the taxonomy looks, as expected, different since different databases were used. But why the phyloflash headings are different e.g "run01c.PFspades_28_57.691645"? if phyloflash takes the first reads wouldn't the headings be the same? and, where in the output folder can I find the raw reads used without the phyloflash heading formatting? I attached both .all.vsearch.csv files of the same sample.

Many thanks in advance!

Magda run01.all.vsearch.csv

run01c.all.vsearch.csv

HRGV commented 2 years ago

Hi Magda, I think you are confusing reads and assembly output. The header you are showing is the header for one of the assembled sequences generated by spades. These assemblies will always be named LIBRARY_tool_SEQ#_coverage. The reads are not altered and are reported with their original headers in files that start with your LIBRARY name and end with SSU.1.fq and SSU.2.fq for paired end reads. Looking at the CSV files, it appears that the generated CSV files are lacking some separators between the generated assembly header, the best blast hit and the other columns. We will look into that. Thanks for reporting!