bigdatagenomics / adam

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.
Apache License 2.0
1k stars 308 forks source link

Kmers overcounted in Slice countKmers #2362

Closed heuermh closed 2 years ago

heuermh commented 2 years ago
$ adam-submit countSliceKmers -single adam-cli/src/test/resources/contigs.fa contigs.adam.count.txt 21
$ sort contigs.adam.count.txt > contigs.adam.count.sorted.txt

$ jellyfish count -m 21 -s 100M -t 8 -o contigs.count.jf adam-cli/src/test/resources/contigs.fa
$ jellyfish dump -c contigs.count.jf > contigs.jellyfish.count.txt
$ sort contigs.jellyfish.count.txt > contigs.jellyfish.count.sorted.txt

$ diff -w -u contigs.adam.count.sorted.txt contigs.jellyfish.count.sorted.txt | head -n 30
--- contigs.adam.count.sorted.txt   2022-05-06 12:15:28.000000000 -0500
+++ contigs.jellyfish.count.sorted.txt  2022-05-06 10:15:20.000000000 -0500
@@ -305,7 +305,7 @@
 AAAAGGGAACTAGAGAGACTG  1
 AAAAGGTAGATGATAGATAAT  1
 AAAAGGTCACTTTTGTTATGC  1
-AAAAGTAAAATTTTAGCAGTA  2
+AAAAGTAAAATTTTAGCAGTA 1
 AAAAGTAAAGAAAAGGAAGGT  1
 AAAAGTAAAGAGGTATTGGCG  1
 AAAAGTAACATCAAGTCAACC  1
@@ -434,7 +434,7 @@
 AAAATTTCCTGAGGTCCTCTC  1
 AAAATTTGAGAGACAAAATAA  1
 AAAATTTGGTAACCTGAGTCC  1
-AAAATTTTAGCAGTAAAAATG  2
+AAAATTTTAGCAGTAAAAATG 1
 AAAATTTTTGCAATCTGTCCA  1
 AAACAAAAAACTCACTGCAGC  1
 AAACAAAAAATAATAATGTAT  1
@@ -891,7 +891,7 @@
 AAAGTAAAAACAGAAAAATGT  1
 AAAGTAAAACTTAACTTTGTG  1
 AAAGTAAAAGACACACTTGCA  1
-AAAGTAAAATTTTAGCAGTAA  2
+AAAGTAAAATTTTAGCAGTAA 1
 AAAGTAAAGAAAAGGAAGGTA  1
 AAAGTAAAGAGGTATTGGCGT  1
 AAAGTAACATCAAGTCAACCA  1
@@ -1029,7 +1029,7 @@
...