biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
177 stars 73 forks source link

biolink-model RDF is unusable #667

Open balhoff opened 3 years ago

balhoff commented 3 years ago

Hi,

I opened issues related to this before (e.g. #301 more than a year ago) but I think I made a mistake in closing that when opening #397. The simplified prefix.yaml proposal is still desired, but it doesn't fix the big problem that biolink-model.ttl is filled with incorrect prefix expansions. Currently I use this file in a hacky way by manually changing a bunch of identifiers using sed, but it isn't a 100% fix, and it's really inconvenient.

sierra-moxon commented 3 years ago

@balhoff - confirming I understand, this bug is related to: https://github.com/biolink/biolink-model/issues/652 (Review and curate Biolink Model prefix/URI namespaces for internet resolvability)? :)

balhoff commented 3 years ago

@sierra-moxon unfortunately not. :-) It's a very twisted problem related to the JSON-LD context limitations described farther down in #301, and caused by some surprise incompatible changes in JSON-LD 1.0. One fix would be to use new features in JSON-LD 1.1 contexts to force allowing prefixes ending in an underscore. But linkml doesn't have a JSON-LD 1.1 processor right now.

So I wanted to open this to make sure the severity of this problem wasn't lost. Maybe there is something else that can be done without waiting for JSON-LD 1.1 support.

cmungall commented 3 years ago

very sorry about this.

can someone make a minimal example that replicates the problem and make a linkml ticket?

from #397 it seems you had someone working on this @balhoff ... but I don't see any open PRs?

balhoff commented 3 years ago

from #397 it seems you had someone working on this @balhoff ... but I don't see any open PRs?

@hsolbrig merged https://github.com/biolink/biolinkml/pull/262 after a couple of fixes. That PR updated jsonldcontextgen to include the new "@prefix": "true" entries (which are required for JSON-LD 1.1 to handle prefixes ending with underscores). It also adds a new command prefixmapgen which generates a simple YAML file containing a prefix dictionary (it doesn't seem like biolink-model is using this yet).

I'm not sure how close linkml is to having a JSON-LD 1.1-powered prefix expansion system. @hsolbrig what do you think? I'm wondering if in the interim some more hacky approach should be taken, because the current RDF just contains incorrect IRIs.

sierra-moxon commented 3 years ago

https://github.com/biolink/biolink-model/issues/394 is another issue that will be aided by work done on this ticket, closing #394 as a duplicate.

cmungall commented 3 years ago

aparently the upstream issue is fixed, so does this now work?

balhoff commented 3 years ago

One indicator will be that this line:

https://github.com/biolink/biolink-model/blob/802a45b7bba95abc9458f96256b6f70a5477e0d1/biolink-model.ttl#L5613

says:

skos:exactMatch <http://purl.obolibrary.org/obo/RO_0002432> ;
balhoff commented 3 years ago

I would like to keep this open until it's fixed in Biolink.

cmungall commented 3 years ago

that was bizarre it looks like @deepakunni3 accidentally closed this via his fork which included my change that prematurely closed this....

and apologies this is taking so long!!!

cmungall commented 3 years ago

any update on this?

balhoff commented 3 years ago

Many improvements, but still a few issues, e.g., term IRIs in here:

<https://w3id.org/biolink/vocab/SequenceVariant> a linkml:ClassDefinition ;
    OIO:inSubset <https://w3id.org/biolink/vocab/model_organism_database> ;
    skos:altLabel "allele" ;
    skos:broadMatch <https://w3id.org/biolink/vocab/SO:0001060> ;
    skos:definition "An allele that varies in its sequence from what is considered the reference allele at that locus." ;
    skos:exactMatch <https://w3id.org/biolink/vocab/GENO:0000002>,
        <https://w3id.org/biolink/vocab/SIO:010277>,
        <https://w3id.org/biolink/vocab/SO:0001059>,
        <vmc:Allele>,
        <wikidata:Q15304597> ;
balhoff commented 2 years ago

One thing I wanted to note is that there are possibly two different kinds of prefix expansion issues here:

balhoff commented 1 year ago

This has gotten worse in several cases:

https://github.com/biolink/biolink-model/blob/c3e290a6bff926be729c4f880c598a3fcdaf5cb0/biolink-model.ttl#L1399-L1402

https://github.com/biolink/biolink-model/blob/c3e290a6bff926be729c4f880c598a3fcdaf5cb0/biolink-model.ttl#L10607

https://github.com/biolink/biolink-model/blob/c3e290a6bff926be729c4f880c598a3fcdaf5cb0/biolink-model.ttl#L10647-L10648

and many more. These prefixes are changed case, not expanded, and then turned into protocols for malformed IRIs.

balhoff commented 1 year ago

Interestingly the prefixes are correctly expanded in biolink-model.owl.ttl (that file didn't used to have mappings in it).

Compare:

biolink-model.ttl:

<https://w3id.org/biolink/vocab/EnvironmentalFoodContaminant> a linkml:ClassDefinition ;
    skos:inScheme <https://w3id.org/biolink/biolink-model> ;
    skos:relatedMatch <chebi:78299> ;

biolink-model.owl.ttl:

biolink:EnvironmentalFoodContaminant a owl:Class ;
    rdfs:label "environmental food contaminant" ;
    rdfs:subClassOf biolink:ChemicalEntity ;
    skos:relatedMatch <http://purl.obolibrary.org/obo/CHEBI_78299> .
balhoff commented 12 months ago

I took a look at the 3.6.0 version of biolink-model.ttl, and it looks like this problem has gotten a lot worse. I can't find this file for the 4.0.0 release.

https://github.com/biolink/biolink-model/blob/ce4f70988e4141b50fe9e1161d696483094fe192/biolink-model.ttl#L23244-L23263

Edit—maybe this is not exactly the same problem as before, but related in that CURIEs from biolink-model.yaml seem to have a lot of trouble being correctly expanded. Rather than being expanded incorrectly, the ones in the link above are just not expanded at all, and turned into invalid IRIs.