alexpreynolds / kmer-counter

Count kmers with a more efficient (faster) hash table
MIT License
24 stars 5 forks source link

Kmer-counter is creating many kmer feature lines from only 2 sequences. #1

Closed arivers closed 7 years ago

arivers commented 7 years ago

It's great to see that Kmer-counter solves the problem of counting 1 kmer frequency per sequence in multisequence fasta.

I encountered a problem running two long sequences (4M bp and 2M bp) through it.

using the following command: mer-counter --fasta --k 4 ~/Downloads/2seqs.fasta > kmercountertest.txt

It generates 41198 kmer profile lines instead of 2.

I also wondered if it takes into account reverse complementarity. For example if k=4 we would have 256 unique kmers but with since we cannot tell if we sequenced the forward or reverse strand we collapse AAAA and TTTT etc. into 137 unique kmers.

alexpreynolds commented 7 years ago

Thanks, looks like a bug.

I added a helper script to test called generate-random-sequences.py that generates a FASTA file with n sequences, each of specified length. Testing files with 1M-long sequences seems to yield IO problems:

$ ./generate-random-sequences.py 5 1000000 > /tmp/test.fa
$ ../kmer-counter --fasta --k=2 /tmp/test.fa
bash: [27709: 2] tcsetattr: Inappropriate ioctl for device
Segmentation fault: 11

So I'll review how I'm reading in lines and see if I need to make some adjustments there.

This counts a kmer instance in forward and reverse orientation. If there's a hit for TCA, for example, there will be a hit for TGA, the reverse-complement of TCA — and vice versa.

Perhaps there is already established convention for defining which is unique, but I think an issue with collapsing is that, while it would shrink the output, I would need to pick one or the other kmer to collapse to.

arivers commented 7 years ago

I know bbtools kemercountexact.sh uses the "rcomp=f" and jellyfish uses the "-C" flag to control the reporting behavior. The default for both is to report the kmer encountered first and add any instances of its reverse complement to the first kmer. The other mode is to count all kmers as encountered.

If you are adding a kmer to its own kmer bin and its reverse complement bin you might want to warn users about that since the total will be twice as high. Also, are palindromic kmers added to themselves if k is even?

alexpreynolds commented 7 years ago

No, a palindrome is not added twice: https://github.com/alexpreynolds/kmer-counter/blob/master/kmer-counter.cpp#L97-L99

alexpreynolds commented 7 years ago

I fixed the segfault issue by upping the per-line sequence buffer to 256MB. This should give plenty of room except for the very largest FASTA files, I think.

I added a --no-rc option to only print unique kmers (including palindromes). This prints the first kmer in the mapping and ignores the second, if there is a second. Palindromes are treated as unique kmers and included in the output.

I improved the generate-random-sequences.py script to take in a seed value. This helps with generating inputs that are consistent from test to test.

Fixes are in commit 258202f. Feel free to reopen if you run into problems. Thanks for the feedback.

arivers commented 7 years ago

I deleted the old directory, re-cloned and recompiled the program but I get the same issue where 41198 profiles are generated from 2 sequences.

Here is the file: https://drive.google.com/file/d/0B4Ufjq-HxQ4XZmhtLXJNSDZsYWM/view?usp=sharing

$ kmer-counter --fasta --k=4 2seqs.fasta > kmer-countertest.txt

alexpreynolds commented 7 years ago

I made some changes to the parser. Hopefully this works for you, too:

$ ./kmer-counter --fasta --k=4 --no-rc /tmp/2seqs.fasta
>NC_003911.12   GACC:57985 GAGC:30726 GTAA:6903 TCTC:22849 CATC:55081 GTTG:27522 ATTC:16007 CACG:31125 ACAT:20916 GAAG:26368 ATGA:27165 AAGC:25073 GGGT:36668 AGTA:4061 TGCC:60675 TCCA:28025 TAAC:4042 TGCG:54613 GCAT:40505 TCAG:37056 ACTA:3317 AGCG:52431 AAGG:30864 TCGC:59187 CGAC:56262 GTCC:26657 AATC:16950 TATT:8952 ACAG:24818 AACT:13945 CGCG:110823 ACCG:59964 TAAA:4548 ATCT:29579 CTCA:14886 TAGG:8599 CAGG:55480 GCCT:49580 AGTG:8827 CTCT:14622 GCAC:38905 ACGA:25339 ACAC:15161 CATA:13831 AAAC:19129 TCTG:33815 GTAC:7039 GCCA:68908 TTCG:38799 AAGT:8014 GGTG:60876 CGGC:116741 TCGG:65704 TTGG:25521 GAAC:37260 GTAG:6318 TGCT:29136 TTAT:5004 CGTA:8365 ATAT:15103 CAGC:80545 CCAT:39247 AGGG:28365 TGAC:34379 ATAG:16534 TTAG:1920 CATT:16917 CATG:42003 ACGC:34780 TCCG:34521 TCTT:21029 CAAA:21338 TTGA:21233 CAGT:16738 TTGT:15887 GCGG:103238 GTGA:27653 AGCT:24277 TCGA:61389 CACA:16986 CTCG:53664 TGTC:31156 AGTC:9453 AAAT:15397 GGTA:12314 ATCC:39646 CCGG:86863 ACGG:30238 CAAT:18200 CTTG:33238 TTGC:37657 GACG:39508 TATC:19133 TTCC:27197 CCTC:37191 CTCC:15811 GCGC:145903 AGAC:20804 AAAA:14221 GAGT:8217 GGGA:20344 TCTA:7305 GTAT:10728 TGTT:22392 CCCC:38625 GGTT:28223 CGAT:62895 AAAG:22354 AACG:23393 TGCA:29323 CCAG:68023 TATA:5453 GGCG:125241 ATCA:39202 TTCA:24640 CTTA:3020 AGAA:19993 ACCA:34998 TGTA:6249 GAAA:30994 CGGG:74818 AGCC:38418 CCAC:36096 TAGC:7919 AGGT:27768 TAAT:3965 GATC:80157 AATT:10749 CTAG:2367 GCCC:83127 CTGC:56681 CCCA:36956 ACGT:12033 AGGA:26421 TTAA:1747 GGCC:94021
>NC_005072.1    GACC:38567 GAGC:24120 GTAA:20352 TCTC:26018 CATC:44859 GTTG:27249 ATTC:31942 CACG:19869 ACAT:23945 GAAG:31457 ATGA:34812 AAGC:27254 GGGT:26148 AGTA:17882 TGCC:42057 TCCA:32546 TAAC:15317 TGCG:35294 GCAT:33301 TCAG:33092 ACTA:15584 AGCG:34051 AAGG:31188 TCGC:38589 CGAC:35602 GTCC:19728 AATC:33175 TATT:51684 ACAG:21933 AACT:28814 CGCG:67202 ACCG:37957 TAAA:55233 ATCT:41500 CTCA:18178 TAGG:13433 CAGG:39880 GCCT:35205 AGTG:11857 CTCT:21776 GCAC:26409 ACGA:19394 ACAC:12812 CATA:20956 AAAC:31032 TCTG:31043 GTAC:8372 GCCA:47410 TTCG:28740 AAGT:25047 GGTG:41348 CGGC:71050 TCGG:41575 TTGG:31269 GAAC:31046 GTAG:10864 TGCT:29974 TTAT:43709 CGTA:7873 ATAT:42046 CAGC:54908 CCAT:35931 AGGG:22305 TGAC:26800 ATAG:25304 TTAG:21261 CATT:30450 CATG:31724 ACGC:22511 TCCG:23028 TCTT:44222 CAAA:41040 TTGA:40494 CAGT:17707 TTGT:26976 GCGG:63289 GTGA:22845 AGCT:27380 TCGA:41890 CACA:15105 CTCG:34241 TGTC:24613 AGTC:12451 AAAT:67956 GGTA:15523 ATCC:35520 CCGG:52756 ACGG:20055 CAAT:34141 CTTG:33069 TTGC:36760 GACG:25750 TATC:29806 TTCC:30709 CCTC:29620 CTCC:18229 GCGC:88668 AGAC:19251 AAAA:84141 GAGT:12666 GGGA:19056 TCTA:23667 GTAT:17285 TGTT:30354 CCCC:26224 GGTT:27374 CGAT:42865 AAAG:44786 AACG:18102 TGCA:30244 CCAG:47414 TATA:25898 GGCG:76510 ATCA:46191 TTCA:40074 CTTA:18399 AGAA:42423 ACCA:31652 TGTA:15595 GAAA:47501 CGGG:45640 AGCC:28288 CCAC:26113 TAGC:13411 AGGT:27055 TAAT:46637 GATC:58236 AATT:65368 CTAG:8666 GCCC:51860 CTGC:41099 CCCA:29119 ACGT:9634 AGGA:29556 TTAA:50264 GGCC:58670
alexpreynolds commented 7 years ago
$ ./kmer-counter --fasta --k=4 /tmp/2seqs.fasta
>NC_003911.12   GACC:57985 GAGC:30726 GTAA:6903 TCTC:22849 CATC:55081 GTTG:27522 ATTC:16007 CACG:31125 ACAT:20916 GAAG:26368 ATGA:27165 AAGC:25073 GGGT:36668 AGTA:4061 TGCC:60675 TCCA:28025 TAAC:4042 TGCG:54613 CAAC:27522 GCAT:40505 TCAG:37056 ACTA:3317 GATG:55081 AGCG:52431 AAGG:30864 TCGC:59187 CGAC:56262 GTCC:26657 AATC:16950 TATT:8952 ACAG:24818 AACT:13945 CGCG:110823 ACCG:59964 TAAA:4548 ATCT:29579 CTCA:14886 CTTC:26368 TAGG:8599 CAGG:55480 GCCT:49580 CCTG:55480 AGTT:13945 AGTG:8827 CTCT:14622 GCAC:38905 ACGA:25339 ACAC:15161 CATA:13831 AAAC:19129 TCTG:33815 GTAC:7039 GCCA:68908 TTCG:38799 AAGT:8014 TCAT:27165 GGTG:60876 CGGC:116741 TCGG:65704 TTGG:25521 GAAC:37260 GCTT:25073 GTAG:6318 TGCT:29136 TTAT:5004 TCGT:25339 CGTA:8365 ATAT:15103 CAGC:80545 CCAT:39247 AGGG:28365 CGCT:52431 TGAC:34379 ACCC:36668 ATAG:16534 TTAG:1920 CATT:16917 CATG:42003 ACGC:34780 TCCG:34521 AATA:8952 TCTT:21029 CAAA:21338 TTGA:21233 CAGT:16738 TTGT:15887 GCGG:103238 GTTC:37260 GTGA:27653 CAGA:33815 AGCT:24277 CTAC:6318 TCGA:61389 ACTG:16738 ACAA:15887 CACA:16986 AGAT:29579 CTCG:53664 TGTC:31156 AGTC:9453 AAAT:15397 CCGC:103238 TATG:13831 GGTA:12314 ACTT:8014 TGAG:14886 ATCC:39646 CCGG:86863 ACGG:30238 CAAT:18200 CTTG:33238 GCGA:59187 AGCA:29136 TTGC:37657 TGGC:68908 GACG:39508 ATTT:15397 TATC:19133 TCAA:21233 TTCC:27197 GTCA:34379 CCTC:37191 GTGC:38905 CAAG:33238 CGCA:54613 GTTT:19129 CTCC:15811 GCGC:145903 AGGC:49580 AGAC:20804 AAAA:14221 CTAA:1920 GAGT:8217 CACT:8827 GCGT:34780 GGGA:20344 CTAT:16534 GTTA:4042 TCTA:7305 GTAT:10728 AGAG:14622 GGTC:57985 AAGA:21029 GATT:16950 TGTT:22392 CACC:60876 CTGA:37056 CCCC:38625 GGTT:28223 ATGT:20916 CGAT:62895 TACT:4061 CGGT:59964 AAAG:22354 GGAT:39646 AACG:23393 TGCA:29323 CCAG:68023 TATA:5453 TCAC:27653 GGCG:125241 CGAG:53664 TGTG:16986 CCTT:30864 ATCG:62895 TTTA:4548 ATCA:39202 TTCA:24640 CTTA:3020 TGAT:39202 TTTT:14221 GGCA:60675 TGAA:24640 AGAA:19993 ACCA:34998 CCGT:30238 TGTA:6249 CGTG:31125 ATAA:5004 AACA:22392 GAAT:16007 GAAA:30994 TAAG:3020 TACC:12314 CGGG:74818 AGCC:38418 CCAC:36096 GGCT:38418 CTGT:24818 TGGA:28025 GGAC:26657 TAGC:7919 GAGA:22849 ATGG:39247 TGGT:34998 AGGT:27768 ACTC:8217 AATG:16917 GATA:19133 TAAT:3965 CGAA:38799 GATC:80157 CCCG:74818 AACC:28223 GCCG:116741 AATT:10749 ACCT:27768 GTGT:15161 CTAG:2367 GCCC:83127 GTGG:36096 CCCT:28365 GCTC:30726 TAGA:7305 TTAC:6903 TTCT:19993 CCGA:65704 GACA:31156 TTTC:30994 CTTT:22354 GCAA:37657 GTCT:20804 CTGC:56681 GCAG:56681 CCCA:36956 CCAA:25521 GGGC:83127 GCTA:7919 ACGT:12033 TAGT:3317 CGGA:34521 GTCG:56262 GCTG:80545 TCCC:20344 GGAG:15811 ATAC:10728 AGGA:26421 CGTT:23393 TACG:8365 TACA:6249 GAGG:37191 ATTA:3965 GACT:9453 TCCT:26421 CGTC:39508 GGAA:27197 CCTA:8599 GGGG:38625 ATGC:40505 TGGG:36956 CTGG:68023 CGCC:125241 TTTG:21338 TTAA:1747 ATTG:18200 GGCC:94021
>NC_005072.1    GACC:38567 GAGC:24120 GTAA:20352 TCTC:26018 CATC:44859 GTTG:27249 ATTC:31942 CACG:19869 ACAT:23945 GAAG:31457 ATGA:34812 AAGC:27254 GGGT:26148 AGTA:17882 TGCC:42057 TCCA:32546 TAAC:15317 TGCG:35294 CAAC:27249 GCAT:33301 TCAG:33092 ACTA:15584 GATG:44859 AGCG:34051 AAGG:31188 TCGC:38589 CGAC:35602 GTCC:19728 AATC:33175 TATT:51684 ACAG:21933 AACT:28814 CGCG:67202 ACCG:37957 TAAA:55233 ATCT:41500 CTCA:18178 CTTC:31457 TAGG:13433 CAGG:39880 GCCT:35205 CCTG:39880 AGTT:28814 AGTG:11857 CTCT:21776 GCAC:26409 ACGA:19394 ACAC:12812 CATA:20956 AAAC:31032 TCTG:31043 GTAC:8372 GCCA:47410 TTCG:28740 AAGT:25047 TCAT:34812 GGTG:41348 CGGC:71050 TCGG:41575 TTGG:31269 GAAC:31046 GCTT:27254 GTAG:10864 TGCT:29974 TTAT:43709 TCGT:19394 CGTA:7873 ATAT:42046 CAGC:54908 CCAT:35931 AGGG:22305 CGCT:34051 TGAC:26800 ACCC:26148 ATAG:25304 TTAG:21261 CATT:30450 CATG:31724 ACGC:22511 TCCG:23028 AATA:51684 TCTT:44222 CAAA:41040 TTGA:40494 CAGT:17707 TTGT:26976 GCGG:63289 GTTC:31046 GTGA:22845 CAGA:31043 AGCT:27380 CTAC:10864 TCGA:41890 ACTG:17707 ACAA:26976 CACA:15105 AGAT:41500 CTCG:34241 TGTC:24613 AGTC:12451 AAAT:67956 CCGC:63289 TATG:20956 GGTA:15523 ACTT:25047 TGAG:18178 ATCC:35520 CCGG:52756 ACGG:20055 CAAT:34141 CTTG:33069 GCGA:38589 AGCA:29974 TTGC:36760 TGGC:47410 GACG:25750 ATTT:67956 TATC:29806 TCAA:40494 TTCC:30709 GTCA:26800 CCTC:29620 GTGC:26409 CAAG:33069 CGCA:35294 GTTT:31032 CTCC:18229 GCGC:88668 AGGC:35205 AGAC:19251 AAAA:84141 CTAA:21261 GAGT:12666 CACT:11857 GCGT:22511 GGGA:19056 CTAT:25304 GTTA:15317 TCTA:23667 GTAT:17285 AGAG:21776 GGTC:38567 AAGA:44222 GATT:33175 TGTT:30354 CACC:41348 CTGA:33092 CCCC:26224 GGTT:27374 ATGT:23945 CGAT:42865 TACT:17882 CGGT:37957 AAAG:44786 GGAT:35520 AACG:18102 TGCA:30244 CCAG:47414 TATA:25898 TCAC:22845 GGCG:76510 CGAG:34241 TGTG:15105 CCTT:31188 ATCG:42865 TTTA:55233 ATCA:46191 TTCA:40074 CTTA:18399 TGAT:46191 TTTT:84141 GGCA:42057 TGAA:40074 AGAA:42423 ACCA:31652 CCGT:20055 TGTA:15595 CGTG:19869 ATAA:43709 AACA:30354 GAAT:31942 GAAA:47501 TAAG:18399 TACC:15523 CGGG:45640 AGCC:28288 CCAC:26113 GGCT:28288 CTGT:21933 TGGA:32546 GGAC:19728 TAGC:13411 GAGA:26018 ATGG:35931 TGGT:31652 AGGT:27055 ACTC:12666 AATG:30450 GATA:29806 TAAT:46637 CGAA:28740 GATC:58236 CCCG:45640 AACC:27374 GCCG:71050 AATT:65368 ACCT:27055 GTGT:12812 CTAG:8666 GCCC:51860 GTGG:26113 CCCT:22305 GCTC:24120 TAGA:23667 TTAC:20352 TTCT:42423 CCGA:41575 GACA:24613 TTTC:47501 CTTT:44786 GCAA:36760 GTCT:19251 CTGC:41099 GCAG:41099 CCCA:29119 CCAA:31269 GGGC:51860 GCTA:13411 ACGT:9634 TAGT:15584 CGGA:23028 GTCG:35602 GCTG:54908 TCCC:19056 GGAG:18229 ATAC:17285 AGGA:29556 CGTT:18102 TACG:7873 TACA:15595 GAGG:29620 ATTA:46637 GACT:12451 TCCT:29556 CGTC:25750 GGAA:30709 CCTA:13433 GGGG:26224 ATGC:33301 TGGG:29119 CTGG:47414 CGCC:76510 TTTG:41040 TTAA:50264 ATTG:34141 GGCC:58670