raw data or clean data - Githubissues

fujch7 commented 3 years ago

Hi， Thanks for your amazing tool. I have 2 questions. First, should the input reads be the raw data or clean data(which has been quality controlled by trimmomatic and fastuniq)? Second, in my situation, having 6 samples of metagenome data, should I run this tool separately, or merge 6 samples together and then run this tool?

kbseah commented 3 years ago

Thanks for getting in touch!

phyloFlash extracts only those reads that can be mapped to a SSU rRNA reference, so I do not think there will be much difference running it on raw vs. trimmed data.
If the six samples represent different biological samples, it can be useful to run them separately to see if they have different microbial composition (e.g. using the phyloFlash_compare.pl script). If they are technical replicates, or simply different sequencing runs from the same library, then it probably should be OK to pool them. Pooling the libraries may also allow detection of lower-abundance taxa.

fujch7 commented 3 years ago

Hi， I'm very excited to run this program successfully! But I am confused about the reads number summarized in the log file:

    _[22:29:53] Total read segments processed: 326098486
    [22:29:53] insert size median: 241
    [22:29:53] insert size std deviation: 66
    [22:29:53] Summarizing taxonomy from mapping hits to SILVA database
    [22:30:00] done...
    [22:30:01] Forward read segments mapping: 70117
    [22:30:01] Reverse read segments mapping: 70306
    [22:30:01] Reporting mapping statistics for paired end input
    [22:30:01] **Total read pairs with at least one segment mapping: 49149**
    [22:30:01] => **both segments mapping to same reference: 51169**
    [22:30:01] => **both segments mapping to different references: 9552**
    [22:30:01] **Read segments where next segment unmapped: 18981**
    [22:30:01] mapping rate: 0.030%_

Why Total read pairs with at least one segment mapping is always less than both segments mapping to same reference? I don't quite understand the quantitative relationship between the content in bold font.

kbseah commented 2 years ago

Yes, this doesn't seem right to me. Which version of phyloFlash are you running?

Could you please run phyloFlash on the test files (test_F.fq.gz and test_R.fq.gz) included with phyloFlash, and attach the log file of the run here? If you installed using Conda, then those two files should be located in the Conda environment folder under lib/phyloFlash/test_files/.

The count of "Total read pairs with at least one segment mapping" is based on the read names in the Fastq file, whereas the other metrics are based on a running count while processing the whole file. In theory they should match up, and it also serves as a sanity check. Is it possible that the same read name may occur more than once in the file? Were the reads renamed in some way during initial QC or trimming, for example?

HRGV / phyloFlash

raw data or clean data #148