allind / EukDetect

MIT License
40 stars 15 forks source link

invalid escape sequence '\d' on get_uncomputed_taxid_per_busco #50

Open ailtonpcf opened 2 weeks ago

ailtonpcf commented 2 weeks ago

Dear Dr. Lind,

I'm generating a custom eukdetect db and I'm stucked at get_uncomputed_taxid_per_busco.py. It fails with the following message:

""" python /home/qi47rin/proj/00-git/EukDetect/build_db/get_uncomputed_taxid_per_busco.py --speciestax cache/45-create-eukdetect-db/genomes-table/species_taxid.tsv --fasta cache/45-create-eukdetect-db/genes-repeat-filtered/buscos_cdhit99_less10perc_repeats_masked.fna --collapsed_ids cache/45-create-eukdetect-db/busco-cdhit99-renamed/buscos_cdhit99_renamed_busco_seqid_sequential_correspondence.txt --taxdb cache/45-create-eukdetect-db/taxdump/taxa.sqlite > cache/45-create-eukdetect-db/busco-taxid/busco_taxid_link.txt

Activating conda environment: cache/00-conda-env/bdf327b44096dcc3f601392a860ec146_ /home/qi47rin/proj/00-git/EukDetect/build_db/get_uncomputed_taxid_per_busco.py:27: SyntaxWarning: invalid escape sequence '\d' sp = re.split('-\dat\d-', '-'.join(seq.id.split('-')[1:]))[0] /home/qi47rin/proj/00-git/EukDetect/build_db/get_uncomputed_taxid_per_busco.py:46: SyntaxWarning: invalid escape sequence '\d' new = re.split('-\dat\d-', '-'.join(sp.split('-')[1:]))[0] Traceback (most recent call last): File "/home/qi47rin/proj/00-git/EukDetect/build_db/get_uncomputed_taxid_per_busco.py", line 79, in main(sys.argv) File "/home/qi47rin/proj/00-git/EukDetect/build_db/get_uncomputed_taxid_per_busco.py", line 67, in main tree = ncbi.gettopology(taxids) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/qi47rin/proj/02-compost-microbes/cache/00-conda-env/bdf327b44096dcc3f601392a860ec146/lib/python3.12/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 463, in get_topology root = elem2node[1]


KeyError: 1
"""

It follows attached the files I have generated, but taxdump given its size. Do you know what might be happening?

Another question, in the helper section withing the script, when you say "Tab delimited file of species name (as encoded in busco header) and taxonomy ID")", you mean the headers in the fasta file?

Best regards,
Ailton.
[euk-db-asp3.zip](https://github.com/user-attachments/files/16051942/euk-db-asp3.zip)
allind commented 2 weeks ago

Hi Ailton,

Thanks for providing your input files. I've found several problems and fixing them will help.

First, this is the incorrect file for the --collapsed_ids input. You need something titled like buscos_cdhit99_collapsed_seqnames - what you're using as input is for renaming RepeatMasker files. That's not what's causing this issue, but it will cause downstream issues because none of the collapsed sequences will have taxids in the busco_taxid_link.txt file.

I believe that the issue with this error is problems with the busco fasta headers and the species name file. There's an issue with the species names that's causing this error. I was able to fix this by reformatting the species names. First, the fasta sequences all have long prefix that you want to replace. The leading path (cache/somethingeukdb/somethingelse/) should be entirely replaced, and your busco headers should be fixed so that it's formatted like this: [group]-[speciesname]-[busco]-[duplication status]. For example, cache/45-create-eukdetect-db/busco/fungi-tax5061-Aspergillus_niger_strain_KYF3-fungi-tax5061-Aspergillus_niger_strain_KYF3-1645187at2759-S1 should become fungi-tax5061-Aspergillus_niger_strain_KYF3-1645187at2759-S1.

In the species_taxids.txt file, you don't want fasta headers, you just want the species portion of the fasta header. What you have is close to correct - remove the group prefix (fungi-) and you should be good. As a matter of personal preference, I would leave the taxid out of the species name in the header, but I think this should work as is.

Hope that's helpful.

ailtonpcf commented 1 week ago

Dear Dr. Lind,

Thank you very much for your support and time. I still could not move forward :/ I tried to modify my files as you instructed and they are attached here. I noticed that only the first column (e.g fungi-Aspergillus_SPCollapse_SP3) of "buscos_cdhit_collapsed_seqnames.txt" have matches at "buscos_cdhit99_less10perc_repeats_masked.fna". For instance the first header (e.g fungi-Aspergillus_niger_strain_KYF3-fungi-Aspergillus_niger_strain_KYF3-39650at2759-S1) is not found at seqnames file. Is that expected?

Another question, these species repetition are generated by the scripts, or is something happening only on my side? For instance "fungi-Aspergillus_niger_strain_KYF3-fungi-Aspergillus_niger_strain_KYF3...".

Bests, Ailton.

""" rule generate_uncomputed_taxid_per_busco: input: cache/45-create-eukdetect-db/genes-repeat-filtered-no-header-duplicates/buscos_cdhit99_less10perc_repeats_masked.fna, cache/45-create-eukdetect-db/genes-clustered-99perc/buscos_cdhit_collapsed_seqnames.txt, cache/45-create-eukdetect-db/taxdump/taxa.sqlite, cache/45-create-eukdetect-db/00-speciestbltaxid-status/fungi--Aspergillus_flavus_strain_K54A-tax5059-GCA_012896555.1_ASM1289655v1_genomic--Chromosome.done, cache/45-create-eukdetect-db/00-speciestbltaxid-status/fungi--Aspergillus_niger_strain_KYF3-tax5061-GCA_029783925.1_ASM2978392v1_genomic--Chromosome.done, cache/45-create-eukdetect-db/00-speciestbltaxid-status/fungi--Aspergillus_oryzae_strain_KBP3-tax5062-GCA_008032055.1_ASM803205v1_genomic--Chromosome.done output: cache/45-create-eukdetect-db/busco-taxid/busco_taxid_link.txt jobid: 0 reason: Missing output files: cache/45-create-eukdetect-db/busco-taxid/busco_taxid_link.txt resources: mem_mb=10000, disk_mb=0, tmpdir=/scratch/qi47rin, partition=standard, qos=normal, time=3-00:00:0

    python /home/qi47rin/proj/00-git/EukDetect/build_db/get_uncomputed_taxid_per_busco.py --speciestax cache/45-create-eukdetect-db/genomes-table/species_taxid.tsv --fasta cache/45-create-eukdetect-db/genes-repeat-filtered-no-header-duplicates/buscos_cdhit99_less10perc_repeats_masked.fna --collapsed_ids cache/45-create-eukdetect-db/genes-clustered-99perc/buscos_cdhit_collapsed_seqnames.txt --taxdb cache/45-create-eukdetect-db/taxdump/taxa.sqlite > cache/45-create-eukdetect-db/busco-taxid/busco_taxid_link.txt

Activating conda environment: cache/00-conda-env/bdf327b44096dcc3f601392a860ec146_ /home/qi47rin/proj/00-git/EukDetect/build_db/get_uncomputed_taxid_per_busco.py:27: SyntaxWarning: invalid escape sequence '\d' sp = re.split('-\dat\d-', '-'.join(seq.id.split('-')[1:]))[0] /home/qi47rin/proj/00-git/EukDetect/build_db/get_uncomputed_taxid_per_busco.py:46: SyntaxWarning: invalid escape sequence '\d' new = re.split('-\dat\d-', '-'.join(sp.split('-')[1:]))[0] Traceback (most recent call last): File "/home/qi47rin/proj/00-git/EukDetect/build_db/get_uncomputed_taxid_per_busco.py", line 79, in main(sys.argv) File "/home/qi47rin/proj/00-git/EukDetect/build_db/get_uncomputed_taxid_per_busco.py", line 52, in main taxids = [sp_taxids[sp] for sp in other_species]


KeyError: 'Aspergillus_oryzae_strain_KBP3-fungi-Aspergillus_oryzae_strain_KBP3'
"""

[euk-db-asp4-v1.1.zip](https://github.com/user-attachments/files/16176420/euk-db-asp4-v1.1.zip)
ailtonpcf commented 6 days ago

Dear Dr. Lind,

I manage to make the script working removing the "fungi-" part. Thank you.