bioinfo-ut / GenomeTester4

A toolkit for performing set operations - union, intersection and complement - on k-mer lists.
GNU General Public License v3.0
32 stars 14 forks source link

Issues with kmer frequency thresholds #27

Open RvV1979 opened 1 year ago

RvV1979 commented 1 year ago

I am having some trouble understanding GenomeTester4 version 4.2.16 (stable) functionality, notably with the kmer frequency thresholds. For example, I have two datasets A and B and am interested in kmers that are unique for each:

$glistcompare A_31.list B_31.list -dd -c 1 -o dd-c1
$glistquery dd-c1_31_0_diff*.list |head
AAAAAAAAAAAAAAAAAAAAAAAAACAGCCG 4       0
AAAAAAAAAAAAAAAAAAAAAAAAACATCGT 1       0
AAAAAAAAAAAAAAAAAAAAAAAAACCCGTC 4       0
AAAAAAAAAAAAAAAAAAAAAAAAACCGATG 0       1
AAAAAAAAAAAAAAAAAAAAAAAAACCGCGC 1       0
AAAAAAAAAAAAAAAAAAAAAAAAACCGCTC 1       0
AAAAAAAAAAAAAAAAAAAAAAAAACCGGAG 4       0
AAAAAAAAAAAAAAAAAAAAAAAAACCGTCA 2       0
AAAAAAAAAAAAAAAAAAAAAAAAACCTCCG 0       3
AAAAAAAAAAAAAAAAAAAAAAAAACCTCGT 0       2

So far, so good: the list has only those kmers that are in either A or B but not both. However, kmers that occur only once could very well represent sequencing errors. therefore, I want to have a list of unique k-mer that occur at least twice. I assume this can be done using the cutoff frequency. However, this does not work as expected as the second-last kmer occurs only once:

$glistquery dd-c2_31_0_diff*.list |head -n 12
AAAAAAAAAAAAAAAAAAAAAAAAAAGCTCG 9       0
AAAAAAAAAAAAAAAAAAAAAAAAAAGTACG 0       4
AAAAAAAAAAAAAAAAAAAAAAAAAAGTCCG 2       0
AAAAAAAAAAAAAAAAAAAAAAAAACAGCCG 4       0
AAAAAAAAAAAAAAAAAAAAAAAAACAGTCG 2       0
AAAAAAAAAAAAAAAAAAAAAAAAACATCCG 7       0
AAAAAAAAAAAAAAAAAAAAAAAAACATCGA 2       0
AAAAAAAAAAAAAAAAAAAAAAAAACATCTC 0       4
AAAAAAAAAAAAAAAAAAAAAAAAACATGCG 0       2
AAAAAAAAAAAAAAAAAAAAAAAAACATTCG 0       2
AAAAAAAAAAAAAAAAAAAAAAAAACCATCG 1       0
AAAAAAAAAAAAAAAAAAAAAAAAACCCGTC 4       0

Likewise, when I specify a minimum frequency when querying the files, this does not have any effect:

$glistquery dd-c2_31_0_diff*.list -minfreq 2 |head -n 12
AAAAAAAAAAAAAAAAAAAAAAAAAAGCTCG 9       0
AAAAAAAAAAAAAAAAAAAAAAAAAAGTACG 0       4
AAAAAAAAAAAAAAAAAAAAAAAAAAGTCCG 2       0
AAAAAAAAAAAAAAAAAAAAAAAAACAGCCG 4       0
AAAAAAAAAAAAAAAAAAAAAAAAACAGTCG 2       0
AAAAAAAAAAAAAAAAAAAAAAAAACATCCG 7       0
AAAAAAAAAAAAAAAAAAAAAAAAACATCGA 2       0
AAAAAAAAAAAAAAAAAAAAAAAAACATCTC 0       4
AAAAAAAAAAAAAAAAAAAAAAAAACATGCG 0       2
AAAAAAAAAAAAAAAAAAAAAAAAACATTCG 0       2
AAAAAAAAAAAAAAAAAAAAAAAAACCATCG 1       0
AAAAAAAAAAAAAAAAAAAAAAAAACCCGTC 4       0

Is this a bug or am I missing something?

Thanks

MaidoRemm commented 1 year ago

Perhaps --minfreq instead of -minfreq would solve the problem?