-db and -kp options meaning

mglgc commented 1 month ago

First of all, thank you for making freely available the GeneToCN code to the scientific community. The meanings of the -db and -kp options for running the KmerToCN.py script are very confused and we cannot realized what specific files or folders must be used. Indeed the --help option output is not explanatory. Please, could you elaborate a bit more about using those -db and -kp options? Some examples would be of great help. Thank you very much in advance for your reply.

fannydhelia commented 1 month ago

Thank you for reaching out!

I agree that these options can be confusing. Currently, it's a bit of a workaround to run the gmer_counter for k-mer counting while traversing the FASTQ files only once, instead of separately for each gene.

The simpler approach is to use the -g option for the k-mer databases for each gene (one file per gene), along with the -f option for providing the k-mers from the flanking region. For example: -g gene1_kmers.db gene2_kmers.db gene3_kmers.db -f flanking_kmers.db

If you are interested in only one gene, using -g with -f is the easiest method. However, for multiple genes, the FASTQ files will need to be read separately for each gene to count the k-mers, resulting in a longer runtime. Additionally, providing filenames for a large number of genes on the command line can be inconvenient. Also, only one flanking region may be provided this way.

To address this, I have included the -db and -kp options for now. The -db option allows you to provide a single k-mer database file containing all the k-mers from all genes and the flanking region. The -kp option is then used to provide the names (paths) to the original k-mer files separately for each gene and the flanking region, so the program can identify which k-mers belong to which regions. For example:

-db combined_kmers.db -kp genes_with_paths_to_original_kmers.txt

This is described briefly in the "Additional files needed" part of the README file. Essentially. for -db, all the k-mer databases have to be combined into one file. The example of the format of the file for -kp option is also provided in the "Additional files needed" section.

I am currently in the process of changing this functionality so the k-mer databases won't have to be combined by the user. I hope to update the version in github in the coming weeks.

mglgc commented 1 month ago

Thanks so much for your well explained reply. Just to put my two cents in, some commands examples along their required input files always are pretty self explanatory.

bioinfo-ut / GeneToCN

-db and -kp options meaning #6