DessimozLab / read2tree

a tool for inferring species tree from sequencing reads
MIT License
144 stars 18 forks source link

read2tree can't find corresponding CDS for each OMA group #33

Open sci-study opened 1 year ago

sci-study commented 1 year ago

I've subsetted 69 (selected as they include sequences from all genomes of interest) OMA groups composed from 22 genomes using the OMA standalone package. I've also made a fasta file with the corresponding CDS sequences whilst using the same headers found in the OMA groups. However, I'm encountering issues that I'm finding hard to overcome.

i.e formatting examples (Marker gene) Protein 1 [Animal 1] DVAEKCRVL Protein 1 [Animal 2] DVAEKCRVL

(Corresponding CDS file) Protein 1 [Animal 1] ATCGATCGATCG Protein 1 [Animal 2] ATCGATCGATCG

However, when I start the Read2Tree program with the below command (All files and folders (test_markers) are in directory in which I run read2tree).

read2tree --reference --standalone ./test_markers --output_path output_v1 --dna_reference total_orths_cds.fa

I get the error:

--- Load OGs with min 0 species from oma test_markers - mode = marker_genes --- Loading files for pre-filter: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 69/69 [00:00<00:00, 2053.57 OGs/s] 2023-07-12 15:42:14,120 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq from total_orths_cds.fa --- 2023-07-12 15:42:14,121 - read2tree.OGSet - INFO - Loading total_orths_cds.fa into memory. This might take a while . . . Loading OGs: 0%| | 0/69 [00:00<?, ? OGs/s]

Loading OGs: 0%| | 0/69 [06:01<?, ? OGs/s] Traceback (most recent call last): File "/home/youseuf/miniconda3/envs/read2tree2/bin/read2tree", line 4, in import('pkg_resources').run_script('read2tree==0.1.4', 'read2tree') File "/home/youseuf/miniconda3/envs/read2tree2/lib/python3.8/site-packages/pkg_resources/init.py", line 720, in run_script self.require(requires)[0].run_script(script_name, ns) File "/home/youseuf/miniconda3/envs/read2tree2/lib/python3.8/site-packages/pkg_resources/init.py", line 1570, in run_script exec(script_code, namespace, namespace) File "/home/youseuf/miniconda3/envs/read2tree2/lib/python3.8/site-packages/read2tree-0.1.4-py3.8.egg/EGG-INFO/scripts/read2tree", line 16, in File "/home/youseuf/miniconda3/envs/read2tree2/lib/python3.8/site-packages/read2tree-0.1.4-py3.8.egg/read2tree/main.py", line 289, in main File "/home/youseuf/miniconda3/envs/read2tree2/lib/python3.8/site-packages/read2tree-0.1.4-py3.8.egg/read2tree/OGSet.py", line 79, in init File "/home/youseuf/miniconda3/envs/read2tree2/lib/python3.8/site-packages/read2tree-0.1.4-py3.8.egg/read2tree/OGSet.py", line 192, in _load_ogs File "/home/youseuf/miniconda3/envs/read2tree2/lib/python3.8/site-packages/read2tree-0.1.4-py3.8.egg/read2tree/OGSet.py", line 337, in _check_dna_aa_length_consistency File "/home/youseuf/miniconda3/envs/read2tree2/lib/python3.8/site-packages/read2tree-0.1.4-py3.8.egg/read2tree/OGSet.py", line 337, in AttributeError: 'NoneType' object has no attribute 'id'

when I look into the mplog.log file i see:

2023-07-12 15:42:14,120 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq from total_orths_cds.fa --- 2023-07-12 15:42:14,121 - read2tree.OGSet - INFO - Loading total_orths_cds.fa into memory. This might take a while . . . 2023-07-12 15:42:14,146 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): omabrowser.org:80 2023-07-12 15:42:14,200 - urllib3.connectionpool - DEBUG - http://omabrowser.org:80 "GET /api/protein/XP/ HTTP/1.1" 301 162 2023-07-12 15:42:14,202 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): omabrowser.org:443 2023-07-12 15:43:14,326 - urllib3.connectionpool - DEBUG - https://omabrowser.org:443 "GET /api/protein/XP/ HTTP/1.1" 504 160 2023-07-12 15:43:14,329 - read2tree.OGSet - DEBUG - DNA not found for XP_046914939.1_OG24421. 2023-07-12 15:43:14,331 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): omabrowser.org:80 2023-07-12 15:43:14,384 - urllib3.connectionpool - DEBUG - http://omabrowser.org:80 "GET /api/protein/XP/ HTTP/1.1" 301 162 2023-07-12 15:43:14,387 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): omabrowser.org:443 2023-07-12 15:44:14,524 - urllib3.connectionpool - DEBUG - https://omabrowser.org:443 "GET /api/protein/XP/ HTTP/1.1" 504 160 2023-07-12 15:44:14,526 - read2tree.OGSet - DEBUG - DNA not found for XP_027206261.1_OG24421. 2023-07-12 15:44:14,529 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): omabrowser.org:80 2023-07-12 15:44:14,583 - urllib3.connectionpool - DEBUG - http://omabrowser.org:80 "GET /api/protein/XP/ HTTP/1.1" 301 162 2023-07-12 15:44:14,586 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): omabrowser.org:443 2023-07-12 15:45:14,724 - urllib3.connectionpool - DEBUG - https://omabrowser.org:443 "GET /api/protein/XP/ HTTP/1.1" 504 160 2023-07-12 15:45:14,727 - read2tree.OGSet - DEBUG - DNA not found for XP_029824739.1_OG24421. 2023-07-12 15:45:14,935 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): omabrowser.org:80 2023-07-12 15:45:14,988 - urllib3.connectionpool - DEBUG - http://omabrowser.org:80 "GET /api/protein/XP/ HTTP/1.1" 301 162 2023-07-12 15:45:14,991 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): omabrowser.org:443 2023-07-12 15:46:15,132 - urllib3.connectionpool - DEBUG - https://omabrowser.org:443 "GET /api/protein/XP/ HTTP/1.1" 504 160 2023-07-12 15:46:15,135 - read2tree.OGSet - DEBUG - DNA not found for XP_054162837.1_OG24421. 2023-07-12 15:46:15,137 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): omabrowser.org:80 2023-07-12 15:46:15,190 - urllib3.connectionpool - DEBUG - http://omabrowser.org:80 "GET /api/protein/XP/ HTTP/1.1" 301 162 2023-07-12 15:46:15,193 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): omabrowser.org:443 2023-07-12 15:47:15,314 - urllib3.connectionpool - DEBUG - https://omabrowser.org:443 "GET /api/protein/XP/ HTTP/1.1" 504 160 2023-07-12 15:47:15,317 - read2tree.OGSet - DEBUG - DNA not found for XP_053212400.1_OG24421.

Any help would be extremely appreciated.

sci-study commented 1 year ago

For additional information, an example of a protein sequence within an OMA group and its corresponding CDS (located in a single file containing all CDS).

CAG2184331.1 unnamed protein product, partial [oppiella_nova_GCA_905397405] CEKCDGKCVICDSYVRPSTLVRICDECNYGSYQGRCVICGGPGVSDAYYCKECTIQEKDRDGCPKIVNLGSSKTDLFYER KKYGFKKR

CAG2184331.1 unnamed protein product, partial [oppiella_nova_GCA_905397405] TGCGAGAAGTGCGACGGGAAGTGCGTTATCTGCGACTCCTATGTCCGGCCCTCGACTTTGGTCCGCATCTGCGATGAGTGCAACTATGGCTCATATCAGGGCCGGTGTGTCATCTGCGGTGGTCCCGGGGTTAGTGACGCCTACTATTGCAAGGAGTGTACGATTCAGGAGAAGGACAGGGATGGCTGTCCCAAGATTGTCAACTTGGGCTCCAGTAAAACGGATCTCTTTTATGAGCGCAAGAAGTATGGCTTCAAAAAGAGGTGA

sci-study commented 1 year ago

Apologies for commenting so much on my own post.

It appears the issue was similar to https://github.com/DessimozLab/read2tree/issues/20 where manual deletion of all underscores "_" fixed the issue.

Program is currently running, will update when complete.