GATB / bcalm

compacted de Bruijn graph construction in low memory
MIT License
99 stars 20 forks source link

A few duplicate kmers in the unitigs; is this expected? #41

Closed rsharris closed 5 years ago

rsharris commented 5 years ago

I'm seeing a few duplicate kmers in my output, which would seem to be unintended according to this statement in the github README:

... a k-mer and its reverse complement are considered to be the same object, appearing only once in the output, either in forward or reverse orientation.

My bcalm reports version as "BCALM 2, git commit 8137cc2, gatb-core version 1.4.1". I built from source fetched from --branch v2.2.1 a couple days ago.

The rate of duplicates is so low it makes me wonder if I have done something wrong. Exactly 100 duplicate 21-mers out of 140 million.

I had no luck creating a small example. But this is reproducible starting with SRR957915.sra from the short read archive. Here's what I did:

(1) Get abundance 2 21-mers and strip the non-names from the fasta headers.

fastq-dump --fasta SRR957915.sra --stdout > SRR957915.fa
bcalm -kmer-size 21 -abundance-min 2 \
  -in SRR957915.fa \
  -out scrap.SRR957915.bcalm
cat scrap.SRR957915.bcalm.unitigs.fa \
  | sed "s/ .*//" \
  | gzip \
  > SRR957915.unitigs_21.abun_2.fa.gz
rm scrap.SRR957915.bcalm.*

(2) I was expecting the results of step 1 to contain no duplicated 21-mers. So asking bcalm for abundance 2 21-mers from that result should produce an empty file. But instead I end up with 100 length 21 unitigs.

bcalm -kmer-size 21 -abundance-min 2 \
  -in SRR957915.unitigs_21.abun_2.fa.gz \
  -out scrap.SRR957915.bcalm_twice
cat scrap.SRR957915.bcalm_twice.unitigs.fa \
  | sed "s/ .*//" \
  | gzip \
  > SRR957915.unitigs_twice_21.abun_2.fa.gz
rm scrap.SRR957915.bcalm_twice.*
rchikhi commented 5 years ago

hi Bob, thanks much for spotting this. For perfectly circular unitigs, bcalm outputs them with the first kmer equal to the last kmer. This is indeed a bug. I'll work on fixing it.

rsharris commented 5 years ago

I should also have written that this isn't causing me any grief.

Also, in retrospect, it's unclear from my report whether the problem occurs in step 1 or 2 (or both).

Replacing step 2 with an equivalent using jellyfish reveals that the problem did occur during step 1.

rchikhi commented 5 years ago

the bug should be fixed now