DessimozLab / pyham

MIT License
9 stars 5 forks source link

pyham.Ham throws error with OMA species tree and HOGs #15

Closed Thyra closed 1 year ago

Thyra commented 2 years ago

Hi, I'm trying out pyHam and have run into an issue at the very base level: When I try to initialize pyham with the phyloxml and HOG orthoXML from OMA, it immediately throws an error:

# -*- coding: utf-8 -*-
import pyham

# Initialise pyHam with a phyloxml tree and orthoXML HOGs
phyloxml_path = "speciestree.phyloxml"
orthoxml_path3 = "oma-hogs.orthoXML"

pyham_analysis = pyham.Ham(phyloxml_path, orthoxml_path3, use_internal_name=True, tree_format='phyloxml')
Traceback (most recent call last):
  File "run_hog_queries.py", line 8, in <module>
    pyham_analysis = pyham.Ham(phyloxml_path, orthoxml_path3, use_internal_name=True, tree_format='phyloxml')
  File "/home/***/.local/share/virtualenvs/pyham_test-5MNqgX5x/lib/python3.7/site-packages/pyham/ham.py", line 263, in __init__
    self.top_level_hogs, self.extant_gene_map, self.external_id_mapper = self._build_hogs_and_genes(orthoxml_file, filter_object=self.filter_obj)
  File "/home/***/.local/share/virtualenvs/pyham_test-5MNqgX5x/lib/python3.7/site-packages/pyham/ham.py", line 819, in _build_hogs_and_genes
    parser.feed(line)
  File "/home/***/.local/share/virtualenvs/pyham_test-5MNqgX5x/lib/python3.7/site-packages/pyham/parsers.py", line 94, in start
    self.current_species = self.ham_object._get_extant_genome_by_name(**attrib)
  File "/home/***/.local/share/virtualenvs/pyham_test-5MNqgX5x/lib/python3.7/site-packages/pyham/ham.py", line 881, in _get_extant_genome_by_name
    .format(kwargs["name"]))
TypeError: species name 'Methanococcus maripaludis' maps to an ancestral name, not a leaf of the taxonomy

It does work with the example files (run_hog_queries.py), but I've also tried the previous two OMA releases (August and January 2020) and neither of them worked. Am I missing something here? I'm using pyham version 1.1.10 on python 3.7

F4llis commented 2 years ago

Dear Denis,

Can you provide me the link where you downloaded the data ? Then, I'll be able to reproduce the exact setup you used !

Also can you try:

Clement

Thyra commented 2 years ago

Hey Clement,

I had deleted the files in the meantime and had to re-setup so now I'm on pyham version 1.1.11 but the problem still persists. Assuming the above code is in analysis.py this is what I did (essentially I downloaded the files from https://omabrowser.org/oma/current/):

pipenv install pyham
wget https://omabrowser.org/All/speciestree.phyloxml
wget https://omabrowser.org/All/oma-hogs.orthoXML.gz
gunzip oma-hogs.orthoXML.gz
pipenv run python analysis.py

Changing use_internal_name to False leads to a different but apparently related error:

Traceback (most recent call last):
  File "analysis.py", line 8, in <module>
    pyham_analysis = pyham.Ham(phyloxml_path, orthoxml_path3, use_internal_name=False, tree_format='phyloxml')
  File "/home/psaroudakis/.local/share/virtualenvs/pyham_test-5MNqgX5x/lib/python3.7/site-packages/pyham/ham.py", line 263, in __init__
    self.top_level_hogs, self.extant_gene_map, self.external_id_mapper = self._build_hogs_and_genes(orthoxml_file, filter_object=self.filter_obj)
  File "/home/psaroudakis/.local/share/virtualenvs/pyham_test-5MNqgX5x/lib/python3.7/site-packages/pyham/ham.py", line 819, in _build_hogs_and_genes
    parser.feed(line)
  File "/home/psaroudakis/.local/share/virtualenvs/pyham_test-5MNqgX5x/lib/python3.7/site-packages/pyham/parsers.py", line 94, in start
    self.current_species = self.ham_object._get_extant_genome_by_name(**attrib)
  File "/home/psaroudakis/.local/share/virtualenvs/pyham_test-5MNqgX5x/lib/python3.7/site-packages/pyham/ham.py", line 891, in _get_extant_genome_by_name
    raise KeyError('{} node(s) founded for the species name: {}'.format(len(nodes_founded), kwargs['name']))
KeyError: '0 node(s) founded for the species name: Methanococcus maripaludis'

Using the newick format (wget https://omabrowser.org/All/speciestree.nwk) lead to the same errors, both with and without using internal names.

Best, Dennis

alpae commented 2 years ago

Hi Dennis

I just had to solve the same problem for someone else. The trick is that you use the newick tree with internal node labels and a special flag species_resolve_mode="OMA".

pyham_analysis = pyham.Ham(newick_path, orthoxml_path3, tree_format="newick", use_internal_name=True, species_resolve_mode="OMA")

The reason for the problem is that some extant species are labeled with an internal species name (in oma we have actually 4 "Methanococcus maripaludis" species (different strains). the species_resolve_mode="OMA" indicates to use the strain that has the default species name.

I will see if I can make something similar for the phyloxml tree. for now, only the newick seems to work with this.

Cheers Adrian

Thyra commented 2 years ago

Hey Adrian,

great, thank you, can confirm that it works now :-). Are you planning to document this anywhere? I understand pyHam is intended to work with all kinds of resources but I'd assume the OMA datasets are one of the most common use cases? (or do people generally use pyham.Ham(query_database=query, use_data_from='oma') in that case)

Thanks again, Dennis