Knowledge-Graph-Hub / kg-idg

A Knowledge Graph to Illuminate the Druggable Genome
https://knowledge-graph-hub.github.io/kg-idg/
BSD 3-Clause "New" or "Revised" License
9 stars 2 forks source link

Human Protein Atlas edges not present in merged KG #64

Closed caufieldjh closed 2 years ago

caufieldjh commented 2 years ago

Describe the bug

Edges of type biolink:expressed_in appear to be correctly produced by the ProteinAtlasTransform, but are not present in the merged graph.

To Reproduce

python run.py download
python run.py transform
# or 
# python run.py transform -s ProteinAtlasTransform 
# for single transform
python run.py merge
$ grep biolink:expressed_in data/merged/merged-kg_edges.tsv | wc -l
0

Transformed edges look like this:

uuid:92b106b2-5e89-11ec-bbc2-00155d00d735       UniProtKB:Q3KRB8        biolink:expressed_in    GO:0031982      biolink:GeneToExpressionSiteAssociation|biolink:Association     RO:0002206              Human Protein Atlas

and transformed nodes:

GO:0031982      biolink:AnatomicalEntity|biolink:NamedThing     vesicle         Human Protein Atlas

GO terms are present in the merged graph but GO is its own ingest.

Expected behavior

Transformed edges such as that above should be present in the merged graph.

Version

d21ccb9dd46af0acd0773674fad1e3a1e71bb8c8

caufieldjh commented 2 years ago

The intended association class is GeneToExpressionSiteAssociation

caufieldjh commented 2 years ago

A minimal merge with the following merge.yaml:

---
configuration:
  output_directory: data/merged
  checkpoint: false
merged_graph:
  name: IDG graph
  source:
    hpa:
      name: "Human Proteome Atlas"
      input:
        format: tsv
        filename:
          - data/transformed/hpa/hpa-data_nodes.tsv
          - data/transformed/hpa/hpa-data_edges.tsv
  operations:
    - name: kgx.graph_operations.summarize_graph.generate_graph_stats
      args:
        graph_name: IDG Graph
        filename: merged_graph_stats.yaml
        node_facet_properties:
          - provided_by
        edge_facet_properties:
          - provided_by
  destination:
    merged-kg-tsv:
      format: tsv
      compression: tar.gz
      filename: merged-kg-test

appears to correctly add the expressed_in edges:

$ head merged-kg-test_edges.tsv 
id      subject predicate       object  category        relation        provided_by     knowledge_source        source
uuid:88b84f62-5e89-11ec-bbc2-00155d00d735       UniProtKB:O43657        biolink:expressed_in    GO:0005829      biolink:GeneToExpressionSiteAssociation|biolink:Association      RO:0002206              Graph   Human Protein Atlas
uuid:88b8b33a-5e89-11ec-bbc2-00155d00d735       UniProtKB:Q8IZE3        biolink:expressed_in    GO:0005874      biolink:GeneToExpressionSiteAssociation|biolink:Association      RO:0002206              Graph   Human Protein Atlas
uuid:88b8e03a-5e89-11ec-bbc2-00155d00d735       UniProtKB:Q9NSG2        biolink:expressed_in    GO:0005739      biolink:GeneToExpressionSiteAssociation|biolink:Association      RO:0002206              Graph   Human Protein Atlas
caufieldjh commented 2 years ago

This issue appears to have resolved itself:

$ grep biolink:expressed_in merged-kg_edges.tsv | wc -l
12762
$ grep "Human Protein Atlas" merged-kg_edges.tsv | wc -l
12762