allind / EukDetect

MIT License
40 stars 15 forks source link

"Four reads in two markers" rule for paired end reads? #18

Closed wbazant closed 2 years ago

wbazant commented 2 years ago

Hi, I've discovered the "Four reads in two markers" for paired end reads really means two pairs each aligning to a different marker.

Here is an example with three paired end reads. EukDetect gives me read_counts 4:

==> x.1.fq <==
@metazoa-Conus_consors-1194691at2759-S1_321_795_0:0:0_0:0:0_0/1
GTCATAGAGAGAGAAGAGGTAGCGTTTGATCTTGGGTAGTCCATGCAACACCTCCAGAATTTCTGATCCTTTGGTCACCTTCTCTCTGAGTTTGGGTCTC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@metazoa-Conus_consors-917326at2759-S1_1053_1563_0:0:0_0:0:0_1/1
ACCCTCCATGACGAAGCTGCTGCCTTTGGACAGCAGGTTGGTGAACATGCCTGTCGAACTGACTCCACCACCAGCATACATAGAAGGGTTTGCCGTCATC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@metazoa-Conus_consors-917326at2759-S1_1119_1689_0:0:0_0:0:0_0/1
CGCACGCAGCAGCTTAGGATCAAAATAACGGAAGTCATCCACTTCTTGCAGGGATTTCAGCTCCATGAGAGCGTCAACAATGCGTGTTGTTGGTAAATTG

==> x.2.fq <==
@metazoa-Conus_consors-1194691at2759-S1_321_795_0:0:0_0:0:0_0/2
AGCTGAATTCCTCAGCAAGATTGGTGATAAGGCAGGAGCTTTGACCCAGTTCCGTCAGACCTATGACAAGACGGTTGCCCTTGGTTATCGGCTGGACTTG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@metazoa-Conus_consors-917326at2759-S1_1053_1563_0:0:0_0:0:0_1/2
TCTGGAGGGGGAAGATGAGGCAGCCATCAGTATGCTGTCTGACACTACAGCAAAACTCACTTCAGCTGTTAGCTCCTTACCAGAACTTCTGGAGAAGAAG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@metazoa-Conus_consors-917326at2759-S1_1119_1689_0:0:0_0:0:0_0/2
TGTTAGCTCCTTACCAGAACTTCTGGAGAAGAAGCGTCTGATTGACATGCACACCAATATTGCCACTGCCCTTCTTGAGCATATCAAGCACAGAAAGCTG

This is the output: I believe one gets filtered out for low mapq, and then they're counted double.

Name    Taxid   Observed_markers    Read_counts Percent_observed_markers    Total_marker_coverage   Percent_identity
Conus consors   101297  2   4   4.17%   13.02%  100.0%

So currently, for paired end reads, matches to two markers are enough, but for single end reads there's a higher evidence threshold. Is this intentional?

When looking for eukaryotes in MicrobiomeDB data with our own method of alignments to EukDetect's reference I did at first have a rule to require four unique reads, and then I dropped that part since it seemed to reduce sensitivity. However, I've not really considered this separately for single end data, and I'd be interested to know if there's some nuance to this (why the four reads rule etc.) that I'm not aware of.

wbazant commented 2 years ago

Actually, I'm sorry - the rule is not "two pairs each aligning to a different marker", but really, four reads (but two paired reads are counted as two reads). I've found myself some other three reads where only one mate of the pair aligns - i.e. bowtie2 does this:

3 reads; of these:
  3 (100.00%) were paired; of these:
    3 (100.00%) aligned concordantly 0 times
    0 (0.00%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
    ----
    3 pairs aligned 0 times concordantly or discordantly; of these:
      6 mates make up the pairs; of these:
        2 (33.33%) aligned 0 times
        4 (66.67%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
66.67% overall alignment rate
allind commented 2 years ago

Hi, I don't really consider this to be a different level of evidence for paired end reads vs. single end reads. We made the decision that for the cutoff for reporting a species to count forward and reverse reads from a pair separately, as while they are derived from the same piece of DNA, they ultimately contain different information. Two ends of a pair aligning to one marker gene is ultimately more informative about that marker being present in the sample than if one end aligns and the other does not.

However, if you'd additionally like to know whether the aligned reads are derived from the same mate pair or not, your metric works.