bioinfo-ut / GeneToCN

Gene copy number prediction from k-mer frequencies
GNU General Public License v3.0
7 stars 1 forks source link

Creating kmer db #4

Open kendrasc opened 5 months ago

kendrasc commented 5 months ago

Using the glistmaker, I managed to create a .list file using a fasta file for the HG38 reference genome. While I am able to make .db files using this .list file, and while said .db files say they are located on my chromosome of interest (11), the actual kmers seem to all be actually on chromosome 1. The output gene sequence is also from chromosome one.

What might be going on here?

I used the following scripts:

glistmaker HG_38.fna -w 25 -o HG_38

python GeneToKmer.py Genes.txt HG_38.fna HG_38_25.list -o Genes_database/ -i -gt ${my_directory}/GenomeTester4/src/

And my gene locations were:

PGA3 G 11:61203515-61213098 PGA4 G 11:61222347-61231694 PGA5 G 11:61241175-61251444

fannydhelia commented 5 months ago

Right now it expects that the provided fasta file only includes the chromosome of interest. If your fasta file includes the whole reference, the easiest way is to extract chromosome 11 to a different file and use that instead.
This should fix the issue for you, but I will keep in mind that it might be a good idea to change it to allow using files with the whole reference. Thanks for feedback and let me know if there are any other issues!

kendrasc commented 5 months ago

Thanks for the quick response! How might I change it to allow using files with the whole reference?

fannydhelia commented 5 months ago

I can make some changes in the code and let you know when the code has been updated. Could you give me an example of the format for the sequence id / header row of the reference fasta file you use? However, if you don't wish to wait for the update (I will try to get this added asap), right now the fastest solution for you is still to extract chromosome 11 (either with grep, samtools or something else), write it to a separate file and use that.