Knowledge-Graph-Hub / kg-idg

A Knowledge Graph to Illuminate the Druggable Genome
https://knowledge-graph-hub.github.io/kg-idg/
BSD 3-Clause "New" or "Revised" License
9 stars 2 forks source link

uPheno transform yields empty file on Docker but not locally #65

Open caufieldjh opened 2 years ago

caufieldjh commented 2 years ago

Describe the bug

As of build 25 on master, the merge step fails:

...
11:26:23  [KGX][cli_utils.py][        parse_source] INFO: Processing source 'tcrd-protein'
11:26:23  [KGX][cli_utils.py][        parse_source] INFO: Processing source 'string'
11:32:29  [KGX][cli_utils.py][        parse_source] INFO: Processing source 'upheno2'
11:32:56  multiprocessing.pool.RemoteTraceback: 
11:32:56  """
11:32:56  Traceback (most recent call last):
11:32:56    File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
11:32:56      result = (True, func(*args, **kwds))
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 806, in parse_source
11:32:56      transformer.transform(input_args)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 275, in transform
11:32:56      self.process(source_generator, sink)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 315, in process
11:32:56      for rec in source:
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/source/tsv_source.py", line 165, in parse
11:32:56      file_iter = pd.read_csv(
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
11:32:56      return func(*args, **kwargs)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 586, in read_csv
11:32:56      return _read(filepath_or_buffer, kwds)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 482, in _read
11:32:56      parser = TextFileReader(filepath_or_buffer, **kwds)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 811, in __init__
11:32:56      self._engine = self._make_engine(self.engine)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1040, in _make_engine
11:32:56      return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
11:32:56      self._reader = parsers.TextReader(self.handles.handle, **kwds)
11:32:56    File "pandas/_libs/parsers.pyx", line 549, in pandas._libs.parsers.TextReader.__cinit__
11:32:56  pandas.errors.EmptyDataError: No columns to parse from file
11:32:56  """
11:32:56  
11:32:56  The above exception was the direct cause of the following exception:
11:32:56  
11:32:56  Traceback (most recent call last):
11:32:56    File "run.py", line 167, in <module>
11:32:56      cli()
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
11:32:56      return self.main(*args, **kwargs)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1053, in main
11:32:56      rv = self.invoke(ctx)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
11:32:56      return _process_result(sub_ctx.command.invoke(sub_ctx))
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
11:32:56      return ctx.invoke(self.callback, **ctx.params)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
11:32:56      return __callback(*args, **kwargs)
11:32:56    File "run.py", line 86, in merge
11:32:56      load_and_merge(yaml, processes)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/kg_idg/merge_utils/merge_kg.py", line 36, in load_and_merge
11:32:56      merged_graph = merge(yaml_file, processes=processes)
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 681, in merge
11:32:56      stores = [r.get() for r in results]
11:32:56    File "/var/lib/jenkins/workspace/nowledge-graph-hub_kg-idg_master/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 681, in <listcomp>
11:32:56      stores = [r.get() for r in results]
11:32:56    File "/usr/lib/python3.8/multiprocessing/pool.py", line 771, in get
11:32:56      raise self._value
11:32:56  pandas.errors.EmptyDataError: No columns to parse from file

As the error says, the uPheno tsv (node, edge, or both) is empty so it doesn't load.

To Reproduce

python run.py download
python run.py transform
# or for upheno alone:
# python3 run.py transform -s UPhenoTransform
python run.py merge

or see build 25

Expected behavior

The input for the merge, as defined in merge.yaml, should not be empty.

Version

d21ccb9dd46af0acd0773674fad1e3a1e71bb8c8

caufieldjh commented 2 years ago

Just ran a fresh transform on uPheno2 here, and the results are non-empty:

~/kg-idg/data/transformed/upheno2$ ls -lh
total 137M
-rw-r--r-- 1 harry harry 72M Dec 16 13:29 upheno2_edges.tsv
-rw-r--r-- 1 harry harry 66M Dec 16 13:29 upheno2_nodes.tsv
$ head upheno2_edges.tsv 
id      subject predicate       object  category        relation        knowledge_source        logical_interpretation
HP:3000070-biolink:subclass_of-UPHENO:0075973   HP:3000070      biolink:subclass_of     UPHENO:0075973          rdfs:subClassOf upheno_all_with_relations.owl
urn:uuid:8ccea3e4-0906-4eb0-bfec-231e4c82244a   HP:3000070      biolink:affects UBERON:0008595  biolink:Association     UPHENO:0000001  upheno_all_with_relations.owl   owlstar:AllSomeInterpretation
HP:3000070-biolink:related_to-UBERON:0008595    HP:3000070      biolink:related_to      UBERON:0008595          UPHENO:0000003  upheno_all_with_relations.owl
HP:3000070-biolink:subclass_of-HP:0430019       HP:3000070      biolink:subclass_of     HP:0430019              rdfs:subClassOf upheno_all_with_relations.owl
UPHENO:0075973-biolink:subclass_of-UPHENO:0002908       UPHENO:0075973  biolink:subclass_of     UPHENO:0002908          rdfs:subClassOf upheno_all_with_relations.owl
ZP:0000059-biolink:subclass_of-ZP:0005188       ZP:0000059      biolink:subclass_of     ZP:0005188              rdfs:subClassOf upheno_all_with_relations.owl
ZP:0000059-biolink:related_to-OBO:ZFA_0001249   ZP:0000059      biolink:related_to      OBO:ZFA_0001249         UPHENO:0000003  upheno_all_with_relations.owl
ZP:0005188-biolink:subclass_of-UPHENO:0081752   ZP:0005188      biolink:subclass_of     UPHENO:0081752          rdfs:subClassOf upheno_all_with_relations.owl
ZP:0005188-biolink:subclass_of-ZP:0138517       ZP:0005188      biolink:subclass_of     ZP:0138517              rdfs:subClassOf upheno_all_with_relations.owl
$ head upheno2_nodes.tsv 
id      category        name    description     xref    provided_by     synonym 0000233 0000424 0000425 0000426 0000589 0100001 :http://geneontology.org/formats/oboInOwl#created_by     :http://purl.obolibrary.org/obo/chebi/charge    :http://purl.obolibrary.org/obo/chebi/formula   :http://purl.obolibrary.org/obo/chebi/inchi     :http://purl.obolibrary.org/obo/chebi/inchikey   :http://purl.obolibrary.org/obo/chebi/mass      :http://purl.obolibrary.org/obo/chebi/monoisotopicmass  :http://purl.obolibrary.org/obo/chebi/smiles     :http://purl.org/spar/cito/citesAsAuthority     :http://www.w3.org/2004/02/skos/core#closeMatch :ttp://www.geneontology.org/formats/oboInOwl#created_by  BFO_CLIF_specification_label    BFO_OWL_specification_label     Date    alternative_term        author  cl#created_by   comment consider        contributor     created_by       creation_date   creator curator_note    definition_source       depicted_by     deprecated      editor_note     editor_preferred_term   elucidation     example_of_usage fypo#usage      has_alternative_id      has_associated_axiom(fol)       has_associated_axiom(nl)        has_db_xref     has_o_b_o_namespace     hsapdv#end_dpf   hsapdv#end_mpb  hsapdv#end_ypb  hsapdv#start_dpf        hsapdv#start_mpb        hsapdv#start_ypb        imported_from   in_subset       is_a_defining_property_chain_axiom       is_class_level  is_cyclic       is_defined_by   is_metadata_tag is_transitive   logical_macro_assertion_on_an_annotation_property       m_g_p_o_0002032 m_g_p_o_0002037  m_g_p_o_0002098 note    page    present_in_taxon        see_also        shorthand       source  term_editor     type    u_b_p_r_o_p_0000001     u_b_p_r_o_p_0000002      u_b_p_r_o_p_0000003     u_b_p_r_o_p_0000004     u_b_p_r_o_p_0000005     u_b_p_r_o_p_0000006     u_b_p_r_o_p_0000007     u_b_p_r_o_p_0000008     u_b_p_r_o_p_0000009      u_b_p_r_o_p_0000010     u_b_p_r_o_p_0000011     u_b_p_r_o_p_0000012     u_b_p_r_o_p_0000013     u_b_p_r_o_p_0000014     u_b_p_r_o_p_0000015     u_b_p_r_o_p_0000103      u_b_p_r_o_p_0000104     u_b_p_r_o_p_0000105     u_b_p_r_o_p_0000106     u_b_p_r_o_p_0000107     u_b_p_r_o_p_0000108     u_b_p_r_o_p_0000111     u_b_p_r_o_p_0000112
HP:3000070      biolink:NamedThing      Abnormality of levator anguli oris (HPO)        An abnormality of a levator anguli oris.                upheno_all_with_relations.owl                                                                                                                                                                    vasilevs 2015-08-07T03:38:48Z                                                                                                            UMLS:C4073277   human_phenotype owl:Class
UPHENO:0075973  biolink:NamedThing      abnormal levator anguli oris    Abnormality of levator anguli oris.             upheno_all_with_relations.owl   abnormality of levator anguli oris                                                                                                                                                       owl:Class
ZP:0000059      biolink:NamedThing      exocrine pancreas mislocalised anteriorly, abnormal (ZPO)       Abnormal(ly) mislocalised anteriorly (of) exocrine pancreas.    upheno_all_with_relations.owl                                                                                                                                            owl:Class
ZP:0005188      biolink:NamedThing      exocrine pancreas mislocalised, abnormal (ZPO)  Abnormal(ly) mislocalised (of) exocrine pancreas.               upheno_all_with_relations.owl                                                                                                                                                            owl:Class
FYPO:0004937    biolink:NamedThing      decreased RNA level during meiosis I (FYPO)     A cell phenotype in which the amount of RNA measured in a cell is lower than normal during the first meiotic nuclear division. Total RNA or a specific RNA may be affected.              upheno_all_with_relations.owl   decreased RNA accumulation during meiosis I|decreased transcript level during meiosis I|decreased RNA level during first meiotic nuclear division|reduced RNA level during meiosis I                     Consider annotating to a term describing abnormal transcription or abnormal regulation of transcription, but note that changes in RNA levels may result from changes in RNA stability as well as changes in transcription. We recommend noting which RNA(s) were used in the assay when annotating to this term.                  midori  2015-10-14T13:19:59Z                                                                                                                     fission_yeast_phenotype                 owl:Class
FYPO:0002959    biolink:NamedThing      decreased RNA level during meiosis (FYPO)       A cell phenotype in which the amount of RNA measured in a cell is lower than normal during one or both meiotic nuclear divisions. Total RNA or a specific RNA may be affected.           upheno_all_with_relations.owl   decreased RNA accumulation during meiosis|reduced RNA level during meiosis|decreased RNA level during meiotic nuclear division|decreased transcript level during meiosis                                 Consider annotating to a term describing abnormal transcription or abnormal regulation of transcription, but note that changes in RNA levels may result from changes in RNA stability as well as changes in transcription. We recommend noting which RNA(s) were used in the assay when annotating to this term.                  midori  2013-12-04T15:55:26Z                                                                                                                     fission_yeast_phenotype                 owl:Class
MP:0020311      biolink:NamedThing      decreased hydroxymethylbilane synthase activity (MPO)   reduced ability of to catalyze the reaction: H(2)O + 4 porphobilinogen = hydroxymethylbilane + 4 NH(4)(+).               upheno_all_with_relations.owl   decreased uroporphyrinogen synthetase activity|decreased (4-[2-carboxyethyl]-3-[carboxymethyl]pyrrol-2-yl)methyltransferase (hydrolysing)|decreased pre-uroporphyrinogen synthase activity|decreased HMB synthase activity|decreased uroporphyrinogen I synthase activity|decreased uroporphyrinogen synthase activity|decreased porphobilinogen:(4-[2-carboxyethyl]-3-[carboxymethyl]pyrrol-2-yl)methyltransferase (hydrolysing)|decreased porphobilinogen deaminase activity|decreased (4-(2-carboxyethyl)-3-(carboxymethyl)pyrrol-2-yl)methyltransferase (hydrolyzing) activity|decreased uroporphyrinogen I synthetase activity|decreased HMB-synthase activity|decreased porphobilinogen ammonia-lyase (polymerizing)                                                                  owl:Class
MP:0020310      biolink:NamedThing      abnormal hydroxymethylbilane synthase activity (MPO)    altered ability of to catalyze the reaction: H(2)O + 4 porphobilinogen = hydroxymethylbilane + 4 NH(4)(+).               upheno_all_with_relations.owl   abnormal pre-uroporphyrinogen synthase activity|abnormal (4-(2-carboxyethyl)-3-(carboxymethyl)pyrrol-2-yl)methyltransferase (hydrolyzing) activity|abnormal porphobilinogen deaminase activity|abnormal (4-[2-carboxyethyl]-3-[carboxymethyl]pyrrol-2-yl)methyltransferase (hydrolysing)|abnormal uroporphyrinogen synthase activity|abnormal HMB-synthase activity|abnormal porphobilinogen ammonia-lyase (polymerizing)|abnormal uroporphyrinogen I synthetase activity|abnormal uroporphyrinogen synthetase activity|abnormal porphobilinogen:(4-[2-carboxyethyl]-3-[carboxymethyl]pyrrol-2-yl)methyltransferase (hydrolysing)|abnormal HMB synthase activity|abnormal uroporphyrinogen I synthase activity                                                                              owl:Class
ZP:0018110      biolink:NamedThing      cellular amino acid metabolic process disrupted, abnormal (ZPO) Abnormal(ly) disrupted (of) cellular amino acid metabolic process.               upheno_all_with_relations.owl                                                                                                                           owl:Class
caufieldjh commented 2 years ago

A minimal merge with the following config:

---
configuration:
  output_directory: data/merged
  checkpoint: false
merged_graph:
  name: IDG graph
  source:
    upheno2:
      name: "uPheno2"
      input:
        format: tsv
        filename:
          - data/transformed/upheno2/upheno2_edges.tsv
          - data/transformed/upheno2/upheno2_nodes.tsv
  operations:
    - name: kgx.graph_operations.summarize_graph.generate_graph_stats
      args:
        graph_name: IDG Graph
        filename: merged_graph_stats.yaml
        node_facet_properties:
          - provided_by
        edge_facet_properties:
          - provided_by
  destination:
    merged-kg-tsv:
      format: tsv
      compression: tar.gz
      filename: merged-kg-test-upheno

does not produce an empty file:

$ head merged-kg-test-upheno_edges.tsv 
id      subject predicate       object  category        relation        knowledge_source        logical_interpretation
XPO:0128846-biolink:subclass_of-XPO:0128007     XPO:0128846     biolink:subclass_of     XPO:0128007             rdfs:subClassOf Graph
XPO:0128846-biolink:subclass_of-XPO:0128037     XPO:0128846     biolink:subclass_of     XPO:0128037             rdfs:subClassOf Graph
XPO:0128846-biolink:subclass_of-XPO:0102052     XPO:0128846     biolink:subclass_of     XPO:0102052             rdfs:subClassOf Graph
XPO:0128846-biolink:related_to-OBO:XAO_0000277  XPO:0128846     biolink:related_to      OBO:XAO_0000277         UPHENO:0000003  Graph
XPO:0128007-biolink:subclass_of-UPHENO:0068971  XPO:0128007     biolink:subclass_of     UPHENO:0068971          rdfs:subClassOf Graph
XPO:0128007-biolink:subclass_of-UPHENO:0012541  XPO:0128007     biolink:subclass_of     UPHENO:0012541          rdfs:subClassOf Graph
XPO:0128007-biolink:subclass_of-XPO:0101638     XPO:0128007     biolink:subclass_of     XPO:0101638             rdfs:subClassOf Graph
XPO:0128007-biolink:related_to-OBO:XAO_0003000  XPO:0128007     biolink:related_to      OBO:XAO_0003000         UPHENO:0000003  Graph
XPO:0128037-biolink:subclass_of-UPHENO:0068971  XPO:0128037     biolink:subclass_of     UPHENO:0068971          rdfs:subClassOf Graph
...
$ wc -l merged-kg-test-upheno_*.tsv 
   506778 merged-kg-test-upheno_edges.tsv
   185445 merged-kg-test-upheno_nodes.tsv
   692223 total
caufieldjh commented 2 years ago

Running the identical process locally (i.e., the following:

python run.py download
python run.py transform
python run.py merge

) does not raise any pandas errors and the resulting graph appears to contain uPheno2 contents as expected.

This issue appear to be completely specific to the Jenkins build.

caufieldjh commented 2 years ago

A minimal config for download, transform, and merge uPheno only (3df4b918dea20a9af8495e38bd404992cd10294a) also breaks on Jenkins.

caufieldjh commented 2 years ago

This error happens with a minimal run (again, with 3df4b91) on the Docker image alone (caufieldjh/ubuntu20-python-3-8-5-dev:4-with-dbs-v6), i.e., running it locally but not in Jenkins. So there's something that isn't happening properly in the Docker environment.

caufieldjh commented 2 years ago

The issue isn't with the merge, it's with the transform - in Docker, the transform for upheno2 produces node and edge different from those when the transform is run locally (i.e., not in the container). locally:

$ ls -l data/transformed/upheno2/
total 139848
-rw-r--r-- 1 harry harry 74755108 Dec 22 15:57 upheno2_edges.tsv
-rw-r--r-- 1 harry harry 68447459 Dec 22 15:57 upheno2_nodes.tsv

in container:

# ls -l data/transformed/upheno2/
total 73472
-rw-r--r-- 1 root root 75043894 Dec 23 17:56 upheno2_edges.tsv
-rw-r--r-- 1 root root   185447 Dec 23 17:56 upheno2_nodes.tsv

The major issue is that the nodesfile is entirely newlines. So is this something KGX is doing differently with this file in this environment?

# python3
Python 3.8.5 (default, May 27 2021, 13:30:53)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from kgx.cli.cli_utils import transform
RDFLib Version: 5.0.0
>>> transform(inputs=["upheno_all_with_relations.owl"],input_format='owl',output="upheno",output_format='tsv')

This produces the same output, complete with a newlines-only nodesfile.

caufieldjh commented 2 years ago

Currently avoiding this (as of 9217ff2266d67a57e0ca40f9ad1d54789f251b54) by disabling the uPheno2 ETL steps entirely. Will leave this issue open for now as discrepancies between environments yielding different transform output is likely to come up again.