bioinfo-ut / GeneToCN

Gene copy number prediction from k-mer frequencies
GNU General Public License v3.0
9 stars 1 forks source link

-db and -kp options meaning #6

Closed mglgc closed 1 month ago

mglgc commented 1 month ago

First of all, thank you for making freely available the GeneToCN code to the scientific community. The meanings of the -db and -kp options for running the KmerToCN.py script are very confused and we cannot realized what specific files or folders must be used. Indeed the --help option output is not explanatory. Please, could you elaborate a bit more about using those -db and -kp options? Some examples would be of great help. Thank you very much in advance for your reply.

fannydhelia commented 1 month ago

Thank you for reaching out!

I agree that these options can be confusing. Currently, it's a bit of a workaround to run the gmer_counter for k-mer counting while traversing the FASTQ files only once, instead of separately for each gene.

The simpler approach is to use the -g option for the k-mer databases for each gene (one file per gene), along with the -f option for providing the k-mers from the flanking region. For example: -g gene1_kmers.db gene2_kmers.db gene3_kmers.db -f flanking_kmers.db

If you are interested in only one gene, using -g with -f is the easiest method. However, for multiple genes, the FASTQ files will need to be read separately for each gene to count the k-mers, resulting in a longer runtime. Additionally, providing filenames for a large number of genes on the command line can be inconvenient. Also, only one flanking region may be provided this way.

To address this, I have included the -db and -kp options for now. The -db option allows you to provide a single k-mer database file containing all the k-mers from all genes and the flanking region. The -kp option is then used to provide the names (paths) to the original k-mer files separately for each gene and the flanking region, so the program can identify which k-mers belong to which regions. For example:

-db combined_kmers.db -kp genes_with_paths_to_original_kmers.txt

This is described briefly in the "Additional files needed" part of the README file. Essentially. for -db, all the k-mer databases have to be combined into one file. The example of the format of the file for -kp option is also provided in the "Additional files needed" section.

I am currently in the process of changing this functionality so the k-mer databases won't have to be combined by the user. I hope to update the version in github in the coming weeks.

mglgc commented 1 month ago

Thanks so much for your well explained reply. Just to put my two cents in, some commands examples along their required input files always are pretty self explanatory.