Inconsistant outputs from running proper supersets of reads

Nheyer commented 2 years ago

I have been running a few rounds of kraken2 on some metagenomics data that has a high fraction of mouse reads that are not usefull for the project. So to improve processing time I disgaurded all the reads that maped to the mouse genome, then to be a bit less strict I tested allowing bad matches as well as just the reads that did not map at all. I expected that eventualy the number of taxsa represented in the krakan2 output would eventualy level off. However after a mapping quality of ~18 was used the number of taxsa found by kraken2 started to decreess. In fact looking at the kraken ouput the VERY SAME reads that were classified at the more strict set were not being clasified in the larger dataset. Dose anyone know why this could be happening?

Nheyer commented 2 years ago

@dfornika This was caused by R1 and R2 not being properly paired, can you add a warning so that this doesn't happen to others?

dfornika commented 2 years ago

@Nheyer I'm not actively involved in developing this tool, so I won't be adding that feature.

jenniferlu717 commented 2 years ago

@Nheyer what do you mean by not being properly paired?

Nheyer commented 2 years ago

@jenniferlu717 if the first read in read1.fq is not the mate of the first read in read2.fq. The program looks to be assuming they are instead of confirming they are by checking the read names.

Nheyer commented 2 years ago

@jenniferlu717 if you point me towards where the reads are input to the algorithm I can do a pull request for this

DerrickWood / kraken2

Inconsistant outputs from running proper supersets of reads #561