bioinfo-ut / GeneToCN

Gene copy number prediction from k-mer frequencies
GNU General Public License v3.0
9 stars 1 forks source link

very long GeneToKmer running times #7

Closed mglgc closed 1 month ago

mglgc commented 1 month ago

Hi, In a computing environment with 500 Gb RAM and 32 cores, the Kmer database creation of two flanking regions for one gene takes pretty long time, said more than 3 days. The reference genome is 50 Mbp, the sequence length of the chromosome containing the gen is 948629, the gen length is 2857, and the region file looks like this:

GENMUT F contig18.1:1-746636 GENMUT F contig18.1:749495-936823

Please, could you give me any advice what is going wrong? Thank you very much in advance.

fannydhelia commented 1 month ago

Hi,

Please use unique names in the region file for different regions (for example GENMUT1 and GENMUT2), otherwise it should overwrite files created previously with the same name. Only one k-mer database file should be given to KmerToCN for the flanking region, so if you are planning to use k-mers from both of these regions, give the coordinates for both of these on the same row to create one k-mer database file based on these: GENMUT F contig18.1:1-746636 contig18.1:749495-936823 Sorry it is not better described in the README, I will add more details there.

About the time usage, what is the k-mer length you are using and the number of mismatches that are used for filtering (the default is 1 if is not specified)? The most time consuming part is creating k-mer database for flanking regions. We want those to be long enough to get enough k-mers for an accurate estimation of what the k-mer frequency should be in a single-copy region. Using 1 mismatch to check the uniqueness is encouraged to keep the k-mers that are more reliable (not affected by sequencing errors for example). However, when using even only 1 mismatch, for each original k-mer in the region, all possible k-mers with 1 mismatch would also be looked up from the reference list, significantly increasing the time required. I would suggest using smaller flanking regions (try limiting it to 50-100K for example). If the number on k-mers seems to be too small (I have tested mostly with around 2000 k-mers, but a few hundred might be enough, depending on how uniform is the coverage), you could also try to use longer regions, but with -mm 0 option.

I have also managed to speed up glistquery (which is the GenomeTester4 program used for this part) keeping the reference list file on a faster disk, maybe this helps, in case this is an option for you.

To skip this process entirely for human genome, the Kmer_db/Ref_kmers contains lists of suitable unique k-mers for each chromosome that can be used for reference (I will update soon and include a script to create a k-mer database for each gene region based on those). I haven't tested GeneToCN on other organisms.

mglgc commented 1 month ago

Sorry for delayed reply but I was trying some runnings following your advices. Thank you so much.