DessimozLab / read2tree

a tool for inferring species tree from sequencing reads
MIT License
138 stars 18 forks source link

cannot get dna from REST #41

Closed GYQ-form closed 10 months ago

GYQ-form commented 10 months ago

Hello and thank you first for your interesting work.

When I followed the step-by-step tutorial for analysing the corona virus dataset, I met a problem when conduct this step:

> read2tree --reference --standalone ../marker_genes/ --dna_reference ../viruses.cdna.fa.gz
--- Load OGs with min 0 species from oma ../marker_genes - mode = marker_genes ---                                                                                                                                   
Loading files for pre-filter: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 1593.05 OGs/s]
2023-09-06 05:00:23,023 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq from ../viruses.cdna.fa.gz ---
2023-09-06 05:00:23,024 - read2tree.OGSet - INFO - Loading viruses.cdna.fa.gz into memory. This might take a while . . . 
Loading OGs:   0%|                                                                                                                                                                           | 0/8 [00:07<?, ? OGs/s]
Traceback (most recent call last):                                                                        
  File "/home/gurc/mambaforge/envs/evolution/bin/read2tree", line 16, in <module>                                                                                                                                    
    main(sys.argv[1:], exe_name=exe_name(), desc=desc)
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/main.py", line 289, in main
    ogset = OGSet(args, oma_output=oma_output, progress=progress)  # Generate the OGs with their DNA sequences
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/OGSet.py", line 79, in __init__
    self.ogs = self._load_ogs()
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/OGSet.py", line 186, in _load_ogs
    ogs[name].dna = self._get_dna_records(ogs[name].aa,
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/OGSet.py", line 361, in _get_dna_records
    og_cdna.append(self._get_dna_from_fasta(record, db))
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/OGSet.py", line 322, in _get_dna_from_fasta
    return self._get_dna_from_REST(record)
  File "/home/gurc/mambaforge/envs/evolution/lib/python3.10/site-packages/read2tree/OGSet.py", line 282, in _get_dna_from_REST
    seq = oma_record.json()['cdna']
KeyError: 'cdna'

It is quite confusing since I totally followed the instructions in the tutorial when exporting marker genes. I tried to use the backend API directly on the browser, but still got the same 404 response (e.g. X005000001) : image

The OMA database is supposed to contain coronavirus species data since you have already run reed2tree on this data. I can't figure out what went wrong.

Any help would be greatly appreciated.

Best regards, Yuqiao Gong

GYQ-form commented 10 months ago

Additionally, the content of mplog.log file is like:

2023-09-06 08:32:09,473 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq from ../viruses.cdna.fa.gz ---
2023-09-06 08:32:09,473 - read2tree.OGSet - INFO - Loading viruses.cdna.fa.gz into memory. This might take a while . . . 
2023-09-06 08:32:09,476 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): omabrowser.org:80
2023-09-06 08:32:11,897 - urllib3.connectionpool - DEBUG - http://omabrowser.org:80 "GET /api/protein/X006400001/ HTTP/1.1" 301 162
2023-09-06 08:32:11,899 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): omabrowser.org:443
2023-09-06 08:32:14,888 - urllib3.connectionpool - DEBUG - https://omabrowser.org:443 "GET /api/protein/X006400001/ HTTP/1.1" 404 None
sinamajidian commented 10 months ago

Hi Thanks for using read2tree. With --dna_reference, read2tree shouldn't use the restAPI. Are you sure all the protein record in gene_marker folder appeared in viruses.cdna.fa?

Btw, for corona analysis we used specific instance of oma: https://corona.omabrowser.org/oma/home/

GYQ-form commented 10 months ago

There is indeed a problem with my viruses.cdna.fa file. The problem has been fixed, thanks a lot~