Closed erin-thei closed 4 days ago
Sorry for my delayed reply.
I think the reason for this is that the reference is in lowercase and the kmers are in uppercase. The version of the fm-index from v1.3.0 is looking for exact text matches. You could lowercase your k-mers for an easy fix, but I will make a change that automatically uppercases the ref and queries.
Hello,
First of all - thank you for this tool! I have been using this tool for some time now, but I recently discovered an unexpected behavior: a k-mer that is being identified in a genome in call mode is not being identified in that same genome in simple mode.
Commands Executed:
I ran the following command on a dataset of approximately 300 genomes:
unitig-caller --call --refs refs.txt --threads 1 --kmer 20 --out call_output
I then parsed the pyseer output and identified 3 k-mers that were highly specific and sensitive to a subset of that dataset (what I refer to as my “cases”).
Then, using another dataset that composed of the case genomes from the previous dataset as well as additional genomes, I ran the following command:
unitig-caller --simple --refs refs.txt --threads 1 --unitigs kmers.txt —out simple_output
Upon parsing the output, I noticed that the 3 k-mers were not identified in 8 of my case genomes, despite the k-mers being identified in those genomes in the initial call mode output. I know the k-mers are actually present in these 8 genomes because I blasted them to confirm, so I’m wondering what could be causing this.
I am attaching kmers.txt (the 3 kmers) and an example genome where the k-mer was identified in using call mode, but not in the subsequent simple mode.
I am using version 1.3.0. Any insight would be greatly appreciated. I can also provide any additional information as well.
Thanks in advance!
unitig_caller.zip