puppy-align can not found "_cds" substring

Wednesdaysama commented 7 months ago

Hi Tropinis,

I tested the data you provided with the command:

puppy-align -pr INPUT_primerTarget -nt INPUT_nonTarget -o output

It worked well, creating the necessary files and directories (ResultDB.tsv, align_logfile.txt, mmseqs_tmp, and tmp) in the output directory.

But, when I tried running the command with different data:

puppy-align -pr sodalinema -nt geitlerinemaceae_rest -o output

I encountered an error where the "_cds" substring could not be found:

Traceback (most recent call last): File "/home/lianchun.yi1/software/miniconda3/envs/puppy/bin/puppy-align", line 287, in i = name.index("_cds") ValueError: substring not found

Even though the input files do contain the "_cds" substring in their headers. For example, one of the input fasta files looks like this:

lcl|SMDP01000001.1_cds_NMG56934.1_1 [locus_tag=E1H12_00005] [protein=hypothetical protein] [protein_id=NMG56934.1] [location=complement(136..555)] [gbkey=CDS] ATGAATAACAATTTCAATATCAAAAACTTCAACGCCAACAATGCTGCCATAAACCTAGGTGGTACTGTCGAAGGCGATCA GATTGGGACGAATCATAATCAGACCACAAATGCAGAAGTAAAACAAGCCGTCGCTGACTTGCAAGCTCTTCTCGCTGACC TGGAGACTCAACATCCCCAGGTCAGCAGTGAACAGGAAGCCTCAGCCATTATCGAGGCTGAATTCACCGAAATTCGCGAA ACTCCAAACCACCGACTGGCCATCCTCCGCAAGCAGATCCTTAACCCAGAACGTCACCTGCAAGCCATTAAAGCCACCTC GATCGAAGTGGCAAAGTCTGCCTATGAAAAGAGTATCCTAGCCAAGGCGGTAATCACATACTTGGATAAACTGAGCGAAA CCCCCGATCGCGGATTGTGA

If you have any suggestions or comments on how to resolve this issue, I would greatly appreciate it ;)

Thank you, Lianchun

hghezzi commented 7 months ago

Hi!

Thank you for providing very detailed steps of what you tried!

Could you please include a few examples of the filenames (i.e. CDS files ending in .fna) in the folders sodalinema and geitlerinemaceae_rest? Do you also have even just one FASTA header from any of these files before running puppy-align?

Thanks!

Wednesdaysama commented 7 months ago

Hi Hans,

Thanks for your consideration.

They are named after GCA_020386575.1.fna, or GCA_004299065.1.fna, etc. And they are downloaded from NCBI (Genomic coding sequences). Every input .fna file contains multiple headers.

I've attached some of the files for you to look over.

Lianchun

geitlerinemaceae_rest.zip

hghezzi commented 7 months ago

Hi Lianchun!

Thank you again for your response and sorry for the confusion about the instructions.

It seems like the issue might have indeed occurred because of the filenames, which should contain the string "cds". For example, your filename should be called "GCA_020386575_cds.fna". You can find more details on naming requirements in the input section of the github documentation. Also, based on my personal experience, I like changing the GCA... names to something like "Genus_species_cds.fna" so that the downstream outputs are even easier to interpret, but this is totally up to you ;)

Please let me know if this fixes the issue or if we need to so more troubleshooting :)

Hans

Wednesdaysama commented 7 months ago

Hi Hans, it worked! The filenames were the problems. Thank you so much !💯

Tropini-lab / PUPpy

puppy-align can not found "_cds" substring #10