DessimozLab / read2tree

a tool for inferring species tree from sequencing reads
MIT License
142 stars 18 forks source link

record name cleaning #20

Closed M-Zeeb closed 1 year ago

M-Zeeb commented 1 year ago

Hi,

thanks for the great tool.

I stumbled upon a small issue when I was blindly following the instructions to gain viral marker genes (HIV in my case). It seems the "clean_fasta_cdnacds.py" file does not sufficiently clean the names as I had issues downstream due to underscores "". Resulting in "Keyerrors" at various steps. For example when generating the references. Although, it may be that I misunderstood the instructions, after manually removing all underscores it was resolved.

But this is an example of the error:

Example name: "02495|KC156214.1_AGF30950.1_2 [02495]"

Error at reference-generation (I actually could fix this with split "OG" instead of "" in lines 326-328 of "OGSet.py" but then I had errors at the final merging step):

`read2tree  --standalone_path  marker_genes/  --reference --dna_reference  all_cdna_out.fa  

--- Load OGs with min 0 species from oma marker_genes - mode = marker_genes ---

Loading files for pre-filter: 100%|███████████| 9/9 [00:00<00:00, 8355.19 OGs/s]
2023-04-24 10:07:05,211 - read2tree.OGSet - INFO - 

--- Load ogs and find their corresponding DNA seq from all_cdna_out.fa ---

2023-04-24 10:07:05,211 - read2tree.OGSet - INFO - Loading all_cdna_out.fa into memory. This might take a while . . . 
Loading OGs:   0%|                                      | 0/9 [00:00<?, ? OGs/s]

Traceback (most recent call last):

  File "/Users/mz/opt/anaconda3/envs/r2t/bin/read2tree", line 16, in <module>
    main(sys.argv[1:], exe_name=exe_name(), desc=desc)

  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/main.py", line 289, in main
    ogset = OGSet(args, oma_output=oma_output, progress=progress)  # Generate the OGs with their DNA sequences

  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 79, in __init__
    self.ogs = self._load_ogs()

  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 186, in _load_ogs
    ogs[name].dna = self._get_dna_records(ogs[name].aa,

  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 365, in _get_dna_records
    og_cdna.append(self._get_dna_from_fasta(record, db))

  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 326, in _get_dna_from_fasta
    return self._get_dna_from_REST(record) 

  File "/Users/mz/opt/anaconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 282, in _get_dna_from_REST
    seq = oma_record.json()['cdna']

KeyError: 'cdna'`

Original files: https://ftp.ncbi.nlm.nih.gov/genomes/genbank/viral/Human_immunodeficiency_virus_1/all_assembly_versions/GCA_003202495.1_ASM320249v1/GCA_003202495.1_ASM320249v1_translated_cds.faa.gz https://ftp.ncbi.nlm.nih.gov/genomes/genbank/viral/Human_immunodeficiency_virus_1/all_assembly_versions/GCA_003202495.1_ASM320249v1/GCA_003202495.1_ASM320249v1_cds_from_genomic.fna.gz

sinamajidian commented 1 year ago

Dear @M-Zeeb

I've just updated the code which you can download from here. So it doesn't affect the read2tree installation. I tested the new version with the provided assembly and it is working. Please make sure that you remove the output from previous run and let me know whether it works for you. And I'm sorry for the inconvenience.

Regards, Sina

M-Zeeb commented 1 year ago

Dear Sina,

thanks for the quick response! It works now.

Best, Marius