biolink / kgx

KGX is a Python library for exchanging Knowledge Graphs
https://kgx.readthedocs.io
BSD 3-Clause "New" or "Revised" License
114 stars 26 forks source link

Obsolete/Deprecated property is not populated when converting from obojson #470

Closed kevinschaper closed 10 months ago

kevinschaper commented 10 months ago

Looking at kg-phenio and kg-obo, (and digging into the kg-phenio build), it looks like kgx doesn't produce a deprecated column when converting from obojson.

Related to / inspired by: https://github.com/monarch-initiative/monarch-app/issues/421

caufieldjh commented 10 months ago

Looks like it's parsed from obograph: https://github.com/biolink/kgx/blob/4e2d410fbc9b130932f3ad064239ee22044e84c1/kgx/source/obograph_source.py#L334-L336 but the equivalent JSON and OWL parsers don't explicitly parse the field.

kevinschaper commented 10 months ago

@caufieldjh It seems like that means that there isn't exactly a bug in kgx, but we do need to figure out how to pass the node properties into

            transform(
                inputs=[data_file_json],
                input_format="obojson",
                output=os.path.join(self.output_dir, name),
                output_format="tsv",
                stream=False,
            )

Looking at the method declaration, it seems like transform_config is the only way, but since we don't have a config yaml currently, maybe we'd want to add the option to pass in node_properties?

def transform(
    inputs: Optional[List[str]],
    input_format: Optional[str] = None,
    input_compression: Optional[str] = None,
    output: Optional[str] = None,
    output_format: Optional[str] = None,
    output_compression: Optional[str] = None,
    stream: bool = False,
    node_filters: Optional[List[Tuple[str, str]]] = None,
    edge_filters: Optional[List[Tuple[str, str]]] = None,
    transform_config: str = None,
    source: Optional[List] = None,
    knowledge_sources: Optional[List[Tuple[str, str]]] = None,
    # this parameter doesn't get used, but I leave it in
    # for now, in case it signifies an unimplemented concept
    # destination: Optional[List] = None,
    processes: int = 1,
    infores_catalog: Optional[str] = None,
)
caufieldjh commented 10 months ago

I agree - just need the extra param for the transform function, though this does suggest that using a config yaml instead may be a good idea. Is that feasible for the Monarch builds?

kevinschaper commented 10 months ago

The transform call I pasted in above is from kg-phenio, I think that's where we'd need to add config yaml

caufieldjh commented 10 months ago

Ah, well in that case it should be easy to test. I'll open an issue over there and start a PR.

caufieldjh commented 10 months ago

It also looks like the way the deprecated flag is parsed in obograph_source misses many potential instances, since it only looks at meta. If I go from owl -> json -> tsv (as in kg-phenio or other kgs) then this appears to miss the majority of the deprecated flags. They just don't end up in the output tsv nodes.

RichardBruskiewich commented 10 months ago

I think this issue may now be partially resolved by my PR https://github.com/biolink/kgx/pull/475 - which is still a draft PR for the moment.

However, it may be helpful @caufieldjh and @kevinschaper to have you give me a small sample file of the source data from which you feel KGX is not propagating the 'deprecated' flag.

Better yet, if you can, please send me a small snippet of Python KGX transform code which resembles the code what you are currently doing in your (phenio?) code base.

In that way, I can use both the sample data and your snippet to add another unit test for the owl -> json -> tsv transitive transformation, then use it to check if there are any other loose ends in the KGX code that need fixing.

caufieldjh commented 10 months ago

Thanks @RichardBruskiewich ! The source data (from phenio.json) looks like this:

 {
      "id" : "http://purl.obolibrary.org/obo/GO_0051370",
      "lbl" : "obsolete ZASP binding",
      "type" : "INDIVIDUAL",
      "meta" : {
        "definition" : {
          "val" : "OBSOLETE. Binding to Z-band alternatively spliced PDZ motif protein (ZASP). ZASP is a Z-band protein specifically expressed in heart and skeletal muscle. This protein contains N-terminal PDZ domain and C-terminal LIM domain.",
          "xrefs" : [ "PMID:10427098", "PMID:11699871" ]
        },
        "comments" : [ "This term was made obsolete because it represents binding to an individual protein." ],
        "synonyms" : [ {
          "pred" : "hasExactSynonym",
          "val" : "Z-band alternatively spliced PDZ-motif protein binding"
        }, {
          "pred" : "hasExactSynonym",
          "val" : "ZASP binding"
        } ],
        "basicPropertyValues" : [ {
          "pred" : "http://purl.obolibrary.org/obo/IAO_0100001",
          "val" : "GO:0008092"
        }, {
          "pred" : "http://www.geneontology.org/formats/oboInOwl#hasOBONamespace",
          "val" : "molecular_function"
        } ],
        "deprecated" : true
      }
caufieldjh commented 10 months ago

The transform kg-phenio uses is this one: https://github.com/Knowledge-Graph-Hub/kg-phenio/blob/master/kg_phenio/transform_utils/phenio/phenio_transform.py with the specific obojson -> kgx tsv transform code as this: https://github.com/Knowledge-Graph-Hub/kg-phenio/blob/c3f7d0bc0f38ccd60a2724510edebe4cab63b659/kg_phenio/transform_utils/phenio/phenio_transform.py#L122-L129

RichardBruskiewich commented 10 months ago

I assume that the phenio.json data is normally properly wrapped within a suitable KGX-like objson-like wrapper, something like:

{
  "graphs": [
    {
      "nodes": [
        {
          "id": "http://purl.obolibrary.org/obo/GO_0051370",
          "lbl": "obsolete ZASP binding",
          ...etc,
            "deprecated": true
          }
        }
      ],
      "edges": [],
      "id": "http://purl.obolibrary.org/obo/phenio.owl",
      "meta": {
        "subsets": [],
        "xrefs": [],
        "basicPropertyValues": []
      }
    }
  ]
}

Using this basic structure, I confirm that the obograph source parser detects the deprecated property in the data.

I will now iterate towards a unit test to detect transitive conservation of the deprecated status across various transformations.

caufieldjh commented 10 months ago

I assume that the phenio.json data is normally properly wrapped within a suitable KGX-like objson-like wrapper, something like:

Yes, it is - apologies for the rough snippet

RichardBruskiewich commented 10 months ago

Hi @caufieldjh and @kevinschaper , I think the PR https://github.com/biolink/kgx/pull/475 is close to completed. I added a unit test along the lines of the 'transform' snippet above, with the phenio.json, and it seems to pass.

I suppose I also now need to check the owl <-> (json or tsv) path too?

RichardBruskiewich commented 10 months ago

Fully resolved by https://github.com/biolink/kgx/pull/475 which is now released in KGX release v2.2.5

caufieldjh commented 10 months ago

Thanks again @RichardBruskiewich

RichardBruskiewich commented 10 months ago

Mainstream use of 'deprecate' flag on instances (not classes) of biolink:Entity now supported by PR https://github.com/biolink/biolink-model/pull/1421; now supported in Biolink Model release 3.6.0