bioinfo-ut / GeneToCN

Gene copy number prediction from k-mer frequencies
GNU General Public License v3.0
9 stars 1 forks source link

File format for loc_file in GeneToKmer.py ? #2

Closed MCorentin closed 7 months ago

MCorentin commented 7 months ago

Hi,

I am trying to create a kmer database with GeneToKmer.py. I managed to create the kmer list from my reference genome with glistmaker, but I get an error when trying to build the kmer db itself.

What should be the file format for the "loc_file" argument for GeneToKmer.py ?

I tried the following format as a test:

KIV2 G chr6:161032565-161067901 LPA F chr6:160952515-161087407

But I get the following error:

print("Region " + loc_strings[1]) IndexError: list index out of range

Please find below the commands I used:

${scripts_folder}/GenomeTester4/src/glistmaker --num_threads 8 hg19_min.fa -w 25 -o kmer_db/hg19_min

python3 ${scripts_folder}/GeneToCN-main/GeneToKmer.py KIV2_LPA.locations hg19_min.fa kmer_db/hg19_min_25.list -gt ${scripts_folder}/GenomeTester4/ -o LPA_kmers -i

Thanks, Corentin

fannydhelia commented 7 months ago

Hi,

This would happen if the input file had space as the separator for first two fields, but some other character (tab for example) between the second and third field. You can check just in case, although it doesn't look like it (and I couldn't replicate the error using these two rows for the region file). Could you provide the whole output with lines printed before the error occurred?

The expected output would be: Reading the reference sequence Reading the location file Finding unique k-mers for KIV2

Other notes: At the moment, the input fasta file should include only the chromosome of interest (chr 6 in this case). However, the .list file should be made using the whole reference sequence.

Best, Fanny

MCorentin commented 7 months ago

Hi,

It was indeed an error with the separator, it is working now with the corrected region file.

Thanks for the information I will keep the list and change the input fasta file.

Regards, Corentin