k-merization makes no attempt to avoid Ns

rsharris commented 6 years ago

This is more of an observation than an issue. I'm not sure whether this would be considered a bug or a feature.

The kmerization (at least in MinHash.py lines 758-763) is agnostic to the alphabet. That's a feature if you're trying to compare, e.g., sets of proteins. But if you're trying to compare DNA that has some Ns in it, each N can produce K kmers that are unlikely to exist in the other set(s), reducing the similarity counts.

Might be worth letting the user know this happens. Either in the readme or maybe give a warning if a sequence with non-ACGT is encountered (assuming ACGT is the intended use).

I didn't test the following, but if a reference genome is softmasked, are the kmers containing lowercase going to match the same kmer containing uppercase?

In my case, admittedly atypical, I had several sets of kmers that I wanted to compare. So I converted them to fasta files that had a single fasta header (for each file), and with each line containing an N followed by a distinct kmer. Using MakeDNADatabase and QueryDNADatabase as shown in the README (except with --containment_threshold 0.0), what I got in the end was a csv file with nothing in it but the headers. This confused me, since my query file was one of the 10 files in my database -- I expected to at least see that fact reflected in the output.

The simple solution was to get rid of the Ns and give every kmer its own fasta header. That appears to have worked fine.

The node graph file was much bigger in the attempt that used Ns, consistent with the hypothesis that Ns are included in the kmers. But I still can't explain why it failed to report the perfect match.

dkoslicki commented 6 years ago

Yeah, a note of this issue would be good to put in the Readme. I'm actually making the (arguably poor, or arbitrary) choice to convert non-ACTG characters to G's. See these lines. In the comments, you can see there's an option to just ignore k-mers with non-ACTG characters. This is similar to how it's done in sourmash (the code my MinHash is based off of), but I can't recall why I chose to convert them to G's.

I would like to eventually support protein sequences, but when I wrote this class, khmer didn't support protein sequences (hence why the scripts have DNA in them, to hint at this).

Sequences are converted to all upper-case, so softmasking has no effect.

The odd behavior you observed with the N followed by a distinct k-mer is probably due to the N getting converted to G: khmer.Nodegraph leaves the N in the sequence, while I'm converting it to G. So this is definitely a bug, for both me and khmer. In khmer, apparently they treat N as a valid character (so you need to have the N in the exact same place for it to be detected).

rsharris commented 6 years ago

I was wondering if non-ACGT were treated the same way by all parts of the package.

Your response explains nearly everything I saw. Whether an N stays an N or becomes a G, either way I ended up with K+1 times as many kmers as I intended. And since something downstream treats the Ns differently, none of those wraparound kmers are counted (I presume).

The only thing that I can't explain is why the identical set wouldn't be reported as at least some match to itself. I had k=20, so it seem like the true commonality should have been 1/21.

At this point, for me, this issue is just a curiosity.

dkoslicki commented 4 years ago

Closing as this is now addressed by 18da15444b9c63 (i.e. now any k-mer that has any non-ACTG character is just ignored completely).

dkoslicki / CMash

k-merization makes no attempt to avoid Ns #9