DessimozLab / pyham

MIT License
9 stars 5 forks source link

How to get a tree for current OMA release? #5

Closed olgabot closed 2 years ago

olgabot commented 4 years ago

The current OMA release has an OrthoXML file, but no species newick tree:

https://omabrowser.org/oma/current/

So when I try to use a local file:

pyham_analysis = pyham.Ham(query_database='P53_HUMAN', 
                           hog_file='/home/olga/oma-hogs.orthoXML')

This results in this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-23-83caab228914> in <module>
      1 database_to_query = 'oma'
----> 2 pyham_analysis = pyham.Ham(query_database='P53_HUMAN', hog_file=hog_file)
      3 # pyham_analysis = pyham.Ham(query_database='P53_HUMAN', use_data_from='oma')

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/ham.py in __init__(self, tree_file, hog_file, type_hog_file, filter_object, use_internal_name, orthoXML_as_string, tree_format, phyloxml_internal_name_tag, phyloxml_leaf_name_tag, use_data_from, query_database)
    271                 # This is the actual parser to build HOG/Gene and related Genomes.
    272                 with open(self.hog_file, 'r') as orthoxml_file:
--> 273                     self.top_level_hogs, self.extant_gene_map, self.external_id_mapper = self._build_hogs_and_genes(orthoxml_file, filter_object=self.filter_obj)
    274 
    275             logger.info('Parse Orthoxml: {} top level hogs and {} extant genes extract.'.format(len(self.top_level_hogs),len(self.extant_gene_map)))

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/ham.py in _build_hogs_and_genes(self, file_object, filter_object)
    827 
    828         for line in file_object:
--> 829             parser.feed(line)
    830 
    831         return factory.toplevel_hogs, factory.extant_gene_map, factory.external_id_mapper

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/parsers.py in start(self, tag, attrib)
     92 
     93         if tag == "{http://orthoXML.org/2011/}species":
---> 94             self.current_species = self.ham_object._get_extant_genome_by_name(**attrib)
     95 
     96         elif tag == "{http://orthoXML.org/2011/}gene" and self.filterObj is None:

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/ham.py in _get_extant_genome_by_name(self, **kwargs)
    888                 return extant_genome
    889         else:
--> 890             raise KeyError('{} node(s) founded for the species name: {}'.format(len(nodes_founded), kwargs['name']))
    891 
    892     def _get_ancestral_genome_by_taxon(self, tax_node):

KeyError: '0 node(s) founded for the species name: Heimdallarchaeota archaeon (strain B3-JM-08)'

Which species newick tree file do you recommend using?

olgabot commented 4 years ago

Using this tree from ENSEMBL results in this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-26-006f031be6b0> in <module>
      1 pyham_analysis = pyham.Ham(query_database='P53_HUMAN', 
      2                            tree_file=species_tree_file,
----> 3                            hog_file='/home/olga/oma-hogs.orthoXML')
      4 # pyham_analysis = pyham.Ham(query_database='P53_HUMAN', use_data_from='oma')

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/ham.py in __init__(self, tree_file, hog_file, type_hog_file, filter_object, use_internal_name, orthoXML_as_string, tree_format, phyloxml_internal_name_tag, phyloxml_leaf_name_tag, use_data_from, query_database)
    271                 # This is the actual parser to build HOG/Gene and related Genomes.
    272                 with open(self.hog_file, 'r') as orthoxml_file:
--> 273                     self.top_level_hogs, self.extant_gene_map, self.external_id_mapper = self._build_hogs_and_genes(orthoxml_file, filter_object=self.filter_obj)
    274 
    275             logger.info('Parse Orthoxml: {} top level hogs and {} extant genes extract.'.format(len(self.top_level_hogs),len(self.extant_gene_map)))

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/ham.py in _build_hogs_and_genes(self, file_object, filter_object)
    827 
    828         for line in file_object:
--> 829             parser.feed(line)
    830 
    831         return factory.toplevel_hogs, factory.extant_gene_map, factory.external_id_mapper

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/parsers.py in start(self, tag, attrib)
     92 
     93         if tag == "{http://orthoXML.org/2011/}species":
---> 94             self.current_species = self.ham_object._get_extant_genome_by_name(**attrib)
     95 
     96         elif tag == "{http://orthoXML.org/2011/}gene" and self.filterObj is None:

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/ham.py in _get_extant_genome_by_name(self, **kwargs)
    888                 return extant_genome
    889         else:
--> 890             raise KeyError('{} node(s) founded for the species name: {}'.format(len(nodes_founded), kwargs['name']))
    891 
    892     def _get_ancestral_genome_by_taxon(self, tax_node):

KeyError: '0 node(s) founded for the species name: Heimdallarchaeota archaeon (strain B3-JM-08)'

Let me know if there is anything else I can provide! Thank you. Warmest, Olga

alpae commented 4 years ago

Dear Olga, you find the species tree we used to infer the HOGs also from the download page of the current OMA release, just under the full Hierarchical orthologous group OrthoXML file: https://omabrowser.org/oma/current/

Best wishes Adrian

olgabot commented 4 years ago

Hi Adrian, Thank you so much! Is that the field named "Species phylogeny of HOGs?" It may be helpful to add "(species tree)" to help those who are simply going off of the documentation to find a "tree" in the page. Thanks again! Warmest, Olga


Olga Botvinnik, PhD olgabotvinnik.com http://www.olgabotvinnik.com

On Sat, Feb 22, 2020 at 12:40 PM Adrian Altenhoff notifications@github.com wrote:

Dear Olga, you find the species tree we used to infer the HOGs also from the download page of the current OMA release, just under the full Hierarchical orthologous group OrthoXML file: https://omabrowser.org/oma/current/

Best wishes Adrian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DessimozLab/pyham/issues/5?email_source=notifications&email_token=AAGE24H4F4Q7OK4RIB4BWTDREGET7A5CNFSM4KZMOWNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMVKA4A#issuecomment-589996144, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGE24DK77HKKOAYSZHP5ZDREGET7ANCNFSM4KZMOWNA .

olgabot commented 4 years ago

Hi Adrian, Thanks again for your help! I'm still having issues even with the new file. Here is the command:

pyham_analysis = pyham.Ham(query_database='P53_HUMAN', 
                           tree_file=species_tree_file,
                           tree_format='newick', 
#                            quoted_node_names=True,
                           hog_file='/home/olga/oma-hogs.orthoXML')

And here is the traceback:

---------------------------------------------------------------------------
NewickError                               Traceback (most recent call last)
<ipython-input-37-aac169477ea0> in <module>
      3                            tree_format='newick',
      4 #                            quoted_node_names=True,
----> 5                            hog_file='/home/olga/oma-hogs.orthoXML')
      6 # pyham_analysis = pyham.Ham(query_database='P53_HUMAN', use_data_from='oma')

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/ham.py in __init__(self, tree_file, hog_file, type_hog_file, filter_object, use_internal_name, orthoXML_as_string, tree_format, phyloxml_internal_name_tag, phyloxml_leaf_name_tag, use_data_from, query_database)
    237             raise TypeError("{} is an invalid type phyloxml tag name")
    238 
--> 239         self.taxonomy = tax.Taxonomy(self.tree_file, tree_format=tree_format, use_internal_name=use_internal_name, phyloxml_leaf_name_tag=phyloxml_leaf_name_tag, phyloxml_internal_name_tag=phyloxml_internal_name_tag)
    240         logger.info('Build taxonomy: completed.')
    241 

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/taxonomy.py in __init__(self, tree_file, tree_format, use_internal_name, phyloxml_leaf_name_tag, phyloxml_internal_name_tag)
     44 
     45         # create tree
---> 46         self.tree = self._build_tree(tree_file, tree_format)
     47 
     48         # create internal node name if required

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/taxonomy.py in _build_tree(self, tree_file, tree_format)
    249             with open(tree_file, 'r') as nwk_file:
    250                 self.tree_str = nwk_file.read()
--> 251             return ete3.Tree(self.tree_str, format=1)
    252 
    253         elif tree_format == 'phyloxml':

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/ete3/coretype/tree.py in __init__(self, newick, format, dist, support, name, quoted_node_names)
    209             self._dist = 0.0
    210             read_newick(newick, root_node = self, format=format,
--> 211                         quoted_names=quoted_node_names)
    212 
    213 

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/ete3/parser/newick.py in read_newick(newick, root_node, format, quoted_names)
    249             raise NewickError('Unexisting tree file or Malformed newick tree structure.')
    250         else:
--> 251             return _read_newick_from_string(nw, root_node, matcher, format, quoted_names)
    252 
    253     else:

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/ete3/parser/newick.py in _read_newick_from_string(nw, root_node, matcher, formatcode, quoted_names)
    307         #[leaf, leaf, ')))', leaf), leaf, 'leaf);']
    308         if subchunks[-1] != '' and not subchunks[-1].endswith(';'):
--> 309             raise NewickError('Broken newick structure at: %s' %chunk)
    310 
    311         # lets process the subchunks. Every closing parenthesis will close a

NewickError: Broken newick structure at: "Heimdallarchaeota archaeon 
You may want to check other newick loading flags like 'format' or 'quoted_node_names'.

I've tried both tree_format='newick_string' and tree_format='newick', and there doesn't seem to be an option for quoted_node_names here. What do you suggest?

Thank you! Warmest, Olga

olgabot commented 4 years ago

Additionally, the PhyloXML formatted species tree doesn't work here either:

! wget https://omabrowser.org/All/speciestree.phyloxml
species_tree_phyloxml = 'speciestree.phyloxml'

pyham_analysis = pyham.Ham(query_database='P53_HUMAN', 
                           tree_file=species_tree_phyloxml,
#                            tree_file=species_tree_file,
                           tree_format='phyloxml', 
#                            quoted_node_names=True,
                           hog_file='/home/olga/oma-hogs.orthoXML')

Here is the traceback:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-43-f4aa9e502496> in <module>
      4                            tree_format='phyloxml',
      5 #                            quoted_node_names=True,
----> 6                            hog_file='/home/olga/oma-hogs.orthoXML')
      7 # pyham_analysis = pyham.Ham(query_database='P53_HUMAN', use_data_from='oma')

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/ham.py in __init__(self, tree_file, hog_file, type_hog_file, filter_object, use_internal_name, orthoXML_as_string, tree_format, phyloxml_internal_name_tag, phyloxml_leaf_name_tag, use_data_from, query_database)
    237             raise TypeError("{} is an invalid type phyloxml tag name")
    238 
--> 239         self.taxonomy = tax.Taxonomy(self.tree_file, tree_format=tree_format, use_internal_name=use_internal_name, phyloxml_leaf_name_tag=phyloxml_leaf_name_tag, phyloxml_internal_name_tag=phyloxml_internal_name_tag)
    240         logger.info('Build taxonomy: completed.')
    241 

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/taxonomy.py in __init__(self, tree_file, tree_format, use_internal_name, phyloxml_leaf_name_tag, phyloxml_internal_name_tag)
     44 
     45         # create tree
---> 46         self.tree = self._build_tree(tree_file, tree_format)
     47 
     48         # create internal node name if required

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/taxonomy.py in _build_tree(self, tree_file, tree_format)
    263                 # assign name to extant species
    264                 if node.is_leaf():
--> 265                     node.name = self._get_name_phyloxml(node, self.phyloxml_leaf_name_tag)
    266 
    267                 # assign name to ancestral species

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/taxonomy.py in _get_name_phyloxml(self, node, phyloxml_species_name_tag)
    225 
    226         elif phyloxml_species_name_tag == 'taxonomy_scientific_name':
--> 227             if node.phyloxml_clade.taxonomy[0].scientific_name != '':
    228                 return node.phyloxml_clade.taxonomy[0].scientific_name
    229             else:

IndexError: list index out of range

Can you help me understand what I'm doing wrong here? My PyHam version is 1.1.7.

Thank you! Warmest, Olga

F4llis commented 4 years ago

Hello Olga,

  1. If you want to use the HOGs directly from the public database (using query_database), you don't need to provide a species tree or anythings else just do
my_gene_query = 'P53_RAT'
pyham_analysis = pyham.Ham(query_database=my_gene_query, use_data_from='oma')

It will automatically load the required files and settings !!!

  1. Regarding the newick error from ete3 library, there is no such options in the pyHam library. The problem seems to be in the species tree file. Can you please send it to me so i can check if there is not a problem in there ?

  2. For the phyloxml format, please read this https://zoo.cs.ucl.ac.uk/doc/pyham/index.html#how-to-use-pyham-on-my-dataset and make sure the requirements for phyloxml are met !

If you have any other questions, please ask. Clement

olgabot commented 4 years ago

Hi Clement, Thank you so much for your response! I've responded inline below.

  1. If you want to use the HOGs directly from the public database (using query_database), you don't need to provide a species tree or anythings else just do
my_gene_query = 'P53_RAT'
pyham_analysis = pyham.Ham(query_database=my_gene_query, use_data_from='oma')

It will automatically load the required files and settings !!!

This is great! I wanted to use local files because I wanted to minimize latency for querying each item. I am trying to get all HOGs for three species (mouse, mouse lemur, and human - https://github.com/DessimozLab/pyham/issues/4) and the only way I can think of doing it from the documentation is to query each gene one at a time. Doing this over the web will take a long time so if I can download the files ahead of time, it will at least save the latency, and then I don't need to add a time.sleep(1) to not get rate-limited by the API. If you have a better way of doing this, I am open to suggestions!

  1. Regarding the newick error from ete3 library, there is no such options in the pyHam library. The problem seems to be in the species tree file. Can you please send it to me so i can check if there is not a problem in there ?

This is the species tree I was using: https://omabrowser.org/All/speciestree.nwk

  1. For the phyloxml format, please read this https://zoo.cs.ucl.ac.uk/doc/pyham/index.html#how-to-use-pyham-on-my-dataset and make sure the requirements for phyloxml are met !

This is the phyloxml tree I was using: https://omabrowser.org/All/speciestree.phyloxml

Thank you so much!

olgabot commented 4 years ago

Hi Clement, Thanks again for your help. Using the suggested code you provided above:

my_gene_query = 'P53_RAT'
pyham_analysis = pyham.Ham(query_database=my_gene_query, use_data_from='oma')

Gets me the following 500 error:

---------------------------------------------------------------------------
ErrorMessage                              Traceback (most recent call last)
<ipython-input-46-0ce71051f92a> in <module>
----> 1 pyham_analysis = pyham.Ham(query_database=my_gene_query, use_data_from='oma')

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/pyham/ham.py in __init__(self, tree_file, hog_file, type_hog_file, filter_object, use_internal_name, orthoXML_as_string, tree_format, phyloxml_internal_name_tag, phyloxml_leaf_name_tag, use_data_from, query_database)
    184                 }
    185 
--> 186                 open_tax = client.action(schema, action_phy, params=params_phy)
    187 
    188                 self.tree_file = 'taxonomy_from_oma_open_at_{}.phyloxml'.format(top_level)

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/coreapi/client.py in action(self, document, keys, params, validate, overrides, action, encoding, transform)
    176         # Perform the action, and return a new document.
    177         transport = determine_transport(self.transports, link.url)
--> 178         return transport.transition(link, self.decoders, params=params, link_ancestors=link_ancestors)

~/miniconda3/envs/tabula-microcebus-v2/lib/python3.7/site-packages/coreapi/transports/http.py in transition(self, link, decoders, params, link_ancestors, force_codec)
    384 
    385         if isinstance(result, Error):
--> 386             raise exceptions.ErrorMessage(result)
    387 
    388         return result

ErrorMessage: <Error: 500 Internal Server Error>
    message: "<h1>Server Error (500)</h1>"

Is the server down?

ethanbass commented 4 years ago

Hi Olga, I'm not sure if you're still following but I added the quoted strings options in a fork (https://github.com/ethanbass/pyham). I still can't get the import to work however, since i'm running into more problems downstream of importing the tree.

alpae commented 3 years ago

@olgabot @ethanbass could you see if the situation changed for you, I've rewrote that part of the package that fetches data from OMA.

alpae commented 2 years ago

close issue as this issue seems to be solve.