HRGV / phyloFlash

phyloFlash - A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
GNU General Public License v3.0
75 stars 25 forks source link

raw data or clean data #148

Closed fujch7 closed 2 years ago

fujch7 commented 3 years ago

Hi, Thanks for your amazing tool. I have 2 questions. First, should the input reads be the raw data or clean data(which has been quality controlled by trimmomatic and fastuniq)? Second, in my situation, having 6 samples of metagenome data, should I run this tool separately, or merge 6 samples together and then run this tool?

kbseah commented 3 years ago

Thanks for getting in touch!

fujch7 commented 3 years ago

Hi, I'm very excited to run this program successfully! But I am confused about the reads number summarized in the log file:

    _[22:29:53] Total read segments processed: 326098486
    [22:29:53] insert size median: 241
    [22:29:53] insert size std deviation: 66
    [22:29:53] Summarizing taxonomy from mapping hits to SILVA database
    [22:30:00] done...
    [22:30:01] Forward read segments mapping: 70117
    [22:30:01] Reverse read segments mapping: 70306
    [22:30:01] Reporting mapping statistics for paired end input
    [22:30:01] **Total read pairs with at least one segment mapping: 49149**
    [22:30:01] => **both segments mapping to same reference: 51169**
    [22:30:01] => **both segments mapping to different references: 9552**
    [22:30:01] **Read segments where next segment unmapped: 18981**
    [22:30:01] mapping rate: 0.030%_

Why Total read pairs with at least one segment mapping is always less than both segments mapping to same reference? I don't quite understand the quantitative relationship between the content in bold font.

kbseah commented 2 years ago

Yes, this doesn't seem right to me. Which version of phyloFlash are you running?

Could you please run phyloFlash on the test files (test_F.fq.gz and test_R.fq.gz) included with phyloFlash, and attach the log file of the run here? If you installed using Conda, then those two files should be located in the Conda environment folder under lib/phyloFlash/test_files/.

The count of "Total read pairs with at least one segment mapping" is based on the read names in the Fastq file, whereas the other metrics are based on a running count while processing the whole file. In theory they should match up, and it also serves as a sanity check. Is it possible that the same read name may occur more than once in the file? Were the reads renamed in some way during initial QC or trimming, for example?