Open caufieldjh opened 2 years ago
Just ran a fresh transform on uPheno2 here, and the results are non-empty:
~/kg-idg/data/transformed/upheno2$ ls -lh
total 137M
-rw-r--r-- 1 harry harry 72M Dec 16 13:29 upheno2_edges.tsv
-rw-r--r-- 1 harry harry 66M Dec 16 13:29 upheno2_nodes.tsv
$ head upheno2_edges.tsv
id subject predicate object category relation knowledge_source logical_interpretation
HP:3000070-biolink:subclass_of-UPHENO:0075973 HP:3000070 biolink:subclass_of UPHENO:0075973 rdfs:subClassOf upheno_all_with_relations.owl
urn:uuid:8ccea3e4-0906-4eb0-bfec-231e4c82244a HP:3000070 biolink:affects UBERON:0008595 biolink:Association UPHENO:0000001 upheno_all_with_relations.owl owlstar:AllSomeInterpretation
HP:3000070-biolink:related_to-UBERON:0008595 HP:3000070 biolink:related_to UBERON:0008595 UPHENO:0000003 upheno_all_with_relations.owl
HP:3000070-biolink:subclass_of-HP:0430019 HP:3000070 biolink:subclass_of HP:0430019 rdfs:subClassOf upheno_all_with_relations.owl
UPHENO:0075973-biolink:subclass_of-UPHENO:0002908 UPHENO:0075973 biolink:subclass_of UPHENO:0002908 rdfs:subClassOf upheno_all_with_relations.owl
ZP:0000059-biolink:subclass_of-ZP:0005188 ZP:0000059 biolink:subclass_of ZP:0005188 rdfs:subClassOf upheno_all_with_relations.owl
ZP:0000059-biolink:related_to-OBO:ZFA_0001249 ZP:0000059 biolink:related_to OBO:ZFA_0001249 UPHENO:0000003 upheno_all_with_relations.owl
ZP:0005188-biolink:subclass_of-UPHENO:0081752 ZP:0005188 biolink:subclass_of UPHENO:0081752 rdfs:subClassOf upheno_all_with_relations.owl
ZP:0005188-biolink:subclass_of-ZP:0138517 ZP:0005188 biolink:subclass_of ZP:0138517 rdfs:subClassOf upheno_all_with_relations.owl
$ head upheno2_nodes.tsv
id category name description xref provided_by synonym 0000233 0000424 0000425 0000426 0000589 0100001 :http://geneontology.org/formats/oboInOwl#created_by :http://purl.obolibrary.org/obo/chebi/charge :http://purl.obolibrary.org/obo/chebi/formula :http://purl.obolibrary.org/obo/chebi/inchi :http://purl.obolibrary.org/obo/chebi/inchikey :http://purl.obolibrary.org/obo/chebi/mass :http://purl.obolibrary.org/obo/chebi/monoisotopicmass :http://purl.obolibrary.org/obo/chebi/smiles :http://purl.org/spar/cito/citesAsAuthority :http://www.w3.org/2004/02/skos/core#closeMatch :ttp://www.geneontology.org/formats/oboInOwl#created_by BFO_CLIF_specification_label BFO_OWL_specification_label Date alternative_term author cl#created_by comment consider contributor created_by creation_date creator curator_note definition_source depicted_by deprecated editor_note editor_preferred_term elucidation example_of_usage fypo#usage has_alternative_id has_associated_axiom(fol) has_associated_axiom(nl) has_db_xref has_o_b_o_namespace hsapdv#end_dpf hsapdv#end_mpb hsapdv#end_ypb hsapdv#start_dpf hsapdv#start_mpb hsapdv#start_ypb imported_from in_subset is_a_defining_property_chain_axiom is_class_level is_cyclic is_defined_by is_metadata_tag is_transitive logical_macro_assertion_on_an_annotation_property m_g_p_o_0002032 m_g_p_o_0002037 m_g_p_o_0002098 note page present_in_taxon see_also shorthand source term_editor type u_b_p_r_o_p_0000001 u_b_p_r_o_p_0000002 u_b_p_r_o_p_0000003 u_b_p_r_o_p_0000004 u_b_p_r_o_p_0000005 u_b_p_r_o_p_0000006 u_b_p_r_o_p_0000007 u_b_p_r_o_p_0000008 u_b_p_r_o_p_0000009 u_b_p_r_o_p_0000010 u_b_p_r_o_p_0000011 u_b_p_r_o_p_0000012 u_b_p_r_o_p_0000013 u_b_p_r_o_p_0000014 u_b_p_r_o_p_0000015 u_b_p_r_o_p_0000103 u_b_p_r_o_p_0000104 u_b_p_r_o_p_0000105 u_b_p_r_o_p_0000106 u_b_p_r_o_p_0000107 u_b_p_r_o_p_0000108 u_b_p_r_o_p_0000111 u_b_p_r_o_p_0000112
HP:3000070 biolink:NamedThing Abnormality of levator anguli oris (HPO) An abnormality of a levator anguli oris. upheno_all_with_relations.owl vasilevs 2015-08-07T03:38:48Z UMLS:C4073277 human_phenotype owl:Class
UPHENO:0075973 biolink:NamedThing abnormal levator anguli oris Abnormality of levator anguli oris. upheno_all_with_relations.owl abnormality of levator anguli oris owl:Class
ZP:0000059 biolink:NamedThing exocrine pancreas mislocalised anteriorly, abnormal (ZPO) Abnormal(ly) mislocalised anteriorly (of) exocrine pancreas. upheno_all_with_relations.owl owl:Class
ZP:0005188 biolink:NamedThing exocrine pancreas mislocalised, abnormal (ZPO) Abnormal(ly) mislocalised (of) exocrine pancreas. upheno_all_with_relations.owl owl:Class
FYPO:0004937 biolink:NamedThing decreased RNA level during meiosis I (FYPO) A cell phenotype in which the amount of RNA measured in a cell is lower than normal during the first meiotic nuclear division. Total RNA or a specific RNA may be affected. upheno_all_with_relations.owl decreased RNA accumulation during meiosis I|decreased transcript level during meiosis I|decreased RNA level during first meiotic nuclear division|reduced RNA level during meiosis I Consider annotating to a term describing abnormal transcription or abnormal regulation of transcription, but note that changes in RNA levels may result from changes in RNA stability as well as changes in transcription. We recommend noting which RNA(s) were used in the assay when annotating to this term. midori 2015-10-14T13:19:59Z fission_yeast_phenotype owl:Class
FYPO:0002959 biolink:NamedThing decreased RNA level during meiosis (FYPO) A cell phenotype in which the amount of RNA measured in a cell is lower than normal during one or both meiotic nuclear divisions. Total RNA or a specific RNA may be affected. upheno_all_with_relations.owl decreased RNA accumulation during meiosis|reduced RNA level during meiosis|decreased RNA level during meiotic nuclear division|decreased transcript level during meiosis Consider annotating to a term describing abnormal transcription or abnormal regulation of transcription, but note that changes in RNA levels may result from changes in RNA stability as well as changes in transcription. We recommend noting which RNA(s) were used in the assay when annotating to this term. midori 2013-12-04T15:55:26Z fission_yeast_phenotype owl:Class
MP:0020311 biolink:NamedThing decreased hydroxymethylbilane synthase activity (MPO) reduced ability of to catalyze the reaction: H(2)O + 4 porphobilinogen = hydroxymethylbilane + 4 NH(4)(+). upheno_all_with_relations.owl decreased uroporphyrinogen synthetase activity|decreased (4-[2-carboxyethyl]-3-[carboxymethyl]pyrrol-2-yl)methyltransferase (hydrolysing)|decreased pre-uroporphyrinogen synthase activity|decreased HMB synthase activity|decreased uroporphyrinogen I synthase activity|decreased uroporphyrinogen synthase activity|decreased porphobilinogen:(4-[2-carboxyethyl]-3-[carboxymethyl]pyrrol-2-yl)methyltransferase (hydrolysing)|decreased porphobilinogen deaminase activity|decreased (4-(2-carboxyethyl)-3-(carboxymethyl)pyrrol-2-yl)methyltransferase (hydrolyzing) activity|decreased uroporphyrinogen I synthetase activity|decreased HMB-synthase activity|decreased porphobilinogen ammonia-lyase (polymerizing) owl:Class
MP:0020310 biolink:NamedThing abnormal hydroxymethylbilane synthase activity (MPO) altered ability of to catalyze the reaction: H(2)O + 4 porphobilinogen = hydroxymethylbilane + 4 NH(4)(+). upheno_all_with_relations.owl abnormal pre-uroporphyrinogen synthase activity|abnormal (4-(2-carboxyethyl)-3-(carboxymethyl)pyrrol-2-yl)methyltransferase (hydrolyzing) activity|abnormal porphobilinogen deaminase activity|abnormal (4-[2-carboxyethyl]-3-[carboxymethyl]pyrrol-2-yl)methyltransferase (hydrolysing)|abnormal uroporphyrinogen synthase activity|abnormal HMB-synthase activity|abnormal porphobilinogen ammonia-lyase (polymerizing)|abnormal uroporphyrinogen I synthetase activity|abnormal uroporphyrinogen synthetase activity|abnormal porphobilinogen:(4-[2-carboxyethyl]-3-[carboxymethyl]pyrrol-2-yl)methyltransferase (hydrolysing)|abnormal HMB synthase activity|abnormal uroporphyrinogen I synthase activity owl:Class
ZP:0018110 biolink:NamedThing cellular amino acid metabolic process disrupted, abnormal (ZPO) Abnormal(ly) disrupted (of) cellular amino acid metabolic process. upheno_all_with_relations.owl owl:Class
A minimal merge with the following config:
---
configuration:
output_directory: data/merged
checkpoint: false
merged_graph:
name: IDG graph
source:
upheno2:
name: "uPheno2"
input:
format: tsv
filename:
- data/transformed/upheno2/upheno2_edges.tsv
- data/transformed/upheno2/upheno2_nodes.tsv
operations:
- name: kgx.graph_operations.summarize_graph.generate_graph_stats
args:
graph_name: IDG Graph
filename: merged_graph_stats.yaml
node_facet_properties:
- provided_by
edge_facet_properties:
- provided_by
destination:
merged-kg-tsv:
format: tsv
compression: tar.gz
filename: merged-kg-test-upheno
does not produce an empty file:
$ head merged-kg-test-upheno_edges.tsv
id subject predicate object category relation knowledge_source logical_interpretation
XPO:0128846-biolink:subclass_of-XPO:0128007 XPO:0128846 biolink:subclass_of XPO:0128007 rdfs:subClassOf Graph
XPO:0128846-biolink:subclass_of-XPO:0128037 XPO:0128846 biolink:subclass_of XPO:0128037 rdfs:subClassOf Graph
XPO:0128846-biolink:subclass_of-XPO:0102052 XPO:0128846 biolink:subclass_of XPO:0102052 rdfs:subClassOf Graph
XPO:0128846-biolink:related_to-OBO:XAO_0000277 XPO:0128846 biolink:related_to OBO:XAO_0000277 UPHENO:0000003 Graph
XPO:0128007-biolink:subclass_of-UPHENO:0068971 XPO:0128007 biolink:subclass_of UPHENO:0068971 rdfs:subClassOf Graph
XPO:0128007-biolink:subclass_of-UPHENO:0012541 XPO:0128007 biolink:subclass_of UPHENO:0012541 rdfs:subClassOf Graph
XPO:0128007-biolink:subclass_of-XPO:0101638 XPO:0128007 biolink:subclass_of XPO:0101638 rdfs:subClassOf Graph
XPO:0128007-biolink:related_to-OBO:XAO_0003000 XPO:0128007 biolink:related_to OBO:XAO_0003000 UPHENO:0000003 Graph
XPO:0128037-biolink:subclass_of-UPHENO:0068971 XPO:0128037 biolink:subclass_of UPHENO:0068971 rdfs:subClassOf Graph
...
$ wc -l merged-kg-test-upheno_*.tsv
506778 merged-kg-test-upheno_edges.tsv
185445 merged-kg-test-upheno_nodes.tsv
692223 total
Running the identical process locally (i.e., the following:
python run.py download
python run.py transform
python run.py merge
)
does not raise any pandas
errors and the resulting graph appears to contain uPheno2 contents as expected.
This issue appear to be completely specific to the Jenkins build.
A minimal config for download, transform, and merge uPheno only (3df4b918dea20a9af8495e38bd404992cd10294a) also breaks on Jenkins.
This error happens with a minimal run (again, with 3df4b91) on the Docker image alone (caufieldjh/ubuntu20-python-3-8-5-dev:4-with-dbs-v6), i.e., running it locally but not in Jenkins. So there's something that isn't happening properly in the Docker environment.
The issue isn't with the merge, it's with the transform - in Docker, the transform for upheno2 produces node and edge different from those when the transform is run locally (i.e., not in the container). locally:
$ ls -l data/transformed/upheno2/
total 139848
-rw-r--r-- 1 harry harry 74755108 Dec 22 15:57 upheno2_edges.tsv
-rw-r--r-- 1 harry harry 68447459 Dec 22 15:57 upheno2_nodes.tsv
in container:
# ls -l data/transformed/upheno2/
total 73472
-rw-r--r-- 1 root root 75043894 Dec 23 17:56 upheno2_edges.tsv
-rw-r--r-- 1 root root 185447 Dec 23 17:56 upheno2_nodes.tsv
The major issue is that the nodesfile is entirely newlines. So is this something KGX is doing differently with this file in this environment?
# python3
Python 3.8.5 (default, May 27 2021, 13:30:53)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from kgx.cli.cli_utils import transform
RDFLib Version: 5.0.0
>>> transform(inputs=["upheno_all_with_relations.owl"],input_format='owl',output="upheno",output_format='tsv')
This produces the same output, complete with a newlines-only nodesfile.
Currently avoiding this (as of 9217ff2266d67a57e0ca40f9ad1d54789f251b54) by disabling the uPheno2 ETL steps entirely. Will leave this issue open for now as discrepancies between environments yielding different transform output is likely to come up again.
Describe the bug
As of build 25 on
master
, the merge step fails:As the error says, the uPheno tsv (node, edge, or both) is empty so it doesn't load.
To Reproduce
or see build 25
Expected behavior
The input for the merge, as defined in merge.yaml, should not be empty.
Version
d21ccb9dd46af0acd0773674fad1e3a1e71bb8c8