bacpop / unitig-caller

Methods to determine sequence element (unitig) presence/absence
Apache License 2.0
18 stars 3 forks source link

K-mers Not Being Identified #32

Closed erin-thei closed 4 days ago

erin-thei commented 3 weeks ago

Hello,

First of all - thank you for this tool! I have been using this tool for some time now, but I recently discovered an unexpected behavior: a k-mer that is being identified in a genome in call mode is not being identified in that same genome in simple mode.

Commands Executed:

I ran the following command on a dataset of approximately 300 genomes:

unitig-caller --call --refs refs.txt --threads 1 --kmer 20 --out call_output

I then parsed the pyseer output and identified 3 k-mers that were highly specific and sensitive to a subset of that dataset (what I refer to as my “cases”).

Then, using another dataset that composed of the case genomes from the previous dataset as well as additional genomes, I ran the following command:

unitig-caller --simple --refs refs.txt --threads 1 --unitigs kmers.txt —out simple_output

Upon parsing the output, I noticed that the 3 k-mers were not identified in 8 of my case genomes, despite the k-mers being identified in those genomes in the initial call mode output. I know the k-mers are actually present in these 8 genomes because I blasted them to confirm, so I’m wondering what could be causing this.

I am attaching kmers.txt (the 3 kmers) and an example genome where the k-mer was identified in using call mode, but not in the subsequent simple mode.

I am using version 1.3.0. Any insight would be greatly appreciated. I can also provide any additional information as well.

Thanks in advance!

unitig_caller.zip

johnlees commented 4 days ago

Sorry for my delayed reply.

I think the reason for this is that the reference is in lowercase and the kmers are in uppercase. The version of the fm-index from v1.3.0 is looking for exact text matches. You could lowercase your k-mers for an easy fix, but I will make a change that automatically uppercases the ref and queries.