atomicdata-dev / atomic-data-docs

Atomic Data is a specification to make it easier to exchange data.
https://docs.atomicdata.dev
MIT License
17 stars 7 forks source link

Mapping to existing RDF ontologies (e.g. Schema.org) #94

Open joepio opened 2 years ago

joepio commented 2 years ago

Existing RDF ontologies have some problems that Atomic Data solves:

Read more about atomic & rdf.

But there are many ontologies in existence, and these describe various domains quite accurately. It would be great if we could still get the benefits of atomic data, without losing the information stored in these existing ontologies.

Some thoughts / challenges:

Implementation

Add original-url property to Property class

This original-url would be the URL of the RDF predicate. When serializing to RDF, we could opt in to use this URL. Inversely, when importing RDF, we could search for Properties having that predicate as original URL, and conform to the atomic data constraints (namely, they must resolve to JSON-AD properties)

However, this would come with a challenge. If a server has multiple Properties with the same original-url value, the server can't decide which one should be used. Malicious agents might even inject resources in the Server to mess up mappings.

If we have an explicit mapping resource, we can prevent this.

Mapping resource

A resource that contains a bunch of mappings. This can be referred to while importing RDF.

see lenses #102

Credits to @hoijui for sharing many ideas on this topic

hoijui commented 2 years ago

So as it would optimally work, in my head, is like this:

somewhere under an AD sub-domain, there are AD proxies for .. the 100 most common RDF/OWL ontologies out there. These would have to be auto-generated as much as possible, and the rest should be done in a semi-automated way, for example:

https://github.com/schemaorg/schemaorg/blob/main/data/schema.ttl

would be fed into a script (rdf2ad), together with an other file, which contains a list of propertyName -> dataType mappings. if any propertyType is missing in that mapping, rdf2ad will print an error message and exit 1. Then that mapping has to be added manually. Doing it this way, we need to do relatively little manual work, and yet can still deal with changes/different versions in the RDF ontologies pretty well. ... after converting the RDF ontology to AD, it will be hosted under a URL resembling the original URL, for example:

https://github.com/schemaorg/schemaorg/blob/main/data/schema.ttl

-- converts to -->

# using the source-file URL:
https://rdf-mirror.atomicdata.dev/ontologies/github.com/schemaorg/schemaorg/blob/main/data/schema.ttl.ad

# or the original schema IRI (makes more sense, I think -> easier conversion)
https://rdf-mirror.atomicdata.dev/ontologies/schema.org.ad
hoijui commented 1 year ago

To start brainstorming ideas for how to practically go about getting there, I will outline the roadmap I have in my head right now:

  1. Write a script (BASH/Python/Rust?) that creates a table of the most commonly use RDF/OWL(2) ontologies, including 1 line for released version of each of these ontologies, each line containing at least: IRI, version, raw-data-download-URL

  2. Write a script that syncs the raw-data-download-URL's to the local file-system.

  3. Write a script that collects statistical data over all these ontologies, e.g.:

    • which ontology links to which other
    • ... and how many times
    • both the above in a format suitable to generate a visual graph
    • ...
  4. Start writing a tool (Rust) that converts an RDF/OWL(2) ontology into an AtomicData one. At first, it will only contain classes, properties and their connection.

  5. Test the tool in that state on all the ontologies.

  6. Write a tool/script to convert a "user" data-set (i.e. the OKH-LOSH data) to AtomicData, in a very much simplified form.

  7. ... and back. -> PoC done!

  8. Improve the tools from steps 4, 6 and 7.

joepio commented 1 year ago

Some ideas on how to tackle nr. 4 (assuming you're using schema.org as ttl source, and prefer writing stuff in Rust). So consider these following as substeps.

  1. Parse the owl data in a graph using an RDF library.
  2. Iterate over all Properties, and create Atomic Data Properties from them. You can create JSON-AD strings, but perhaps better is to use the atomic_lib::Resource struct with the .set_propval + save methods. See example.rs or browse through the tests for inspiration.
  3. Then iterate over the Classes. Do something similar as above.
  4. Now, export your data. You'll get a JSON-AD file with the generated Atomic Data classes and properties.
  5. Convert the URLS to something like atomicdata.dev/ontologies/schema/something/ID
  6. Open a PR for atomic-data-rust that contains a JSON-AD file with these ontologies.
  7. ???
  8. Profit!
hoijui commented 1 year ago

I did some initial research for lists of ontologies, and there seem to be some good options! :-)

  1. https://archivo.dbpedia.org/list This is basically perfect. The only thing left to do, is choose which ones we want, or defining a filter that does that for us. I am pretty sure we would not want an ontology that is not available there, As it scans and crawls the web every 8h, and thus catches most of what is out there.
  2. https://lov.linkeddata.es/dataset/lov/sparql Similarly useful like 1. but older, with less vocabularies/ontologies, and no filtering options, less other meta-data, and no info about how often it actualizes. Allows to fetch the data as a dump, a single *.tar.gz file.
  3. https://github.com/zazuko/rdf-vocabularies/tree/master/ontologies A set of data-files of the most common ontologies, created for a Node.js library, and regularly updated. This would be even easier to get and use. It is explicitly made so it would be easy for other projects to use that data as well.
  4. https://github.com/ruby-rdf/rdf-vocab/blob/develop/lib/rdf/vocab.rb Similar to 2., just for a Ruby library instead, and a bit harder to use/extract the data from.
jonassmedegaard commented 1 year ago

Perl-based RDF libraries commonly use one of these methods for stable references to ontologies: a) a dump of http://prefix.cc/popular/all at a fixed point in time b) a manually curated subset of a)

Since it sounds like you want to restrict by certain qualities (e.g. "OWL-based", or "reasonably popular"), I suggest that you do b) - and then if it turns out that prefix.cc does not cover some ontologies you fancy, then there is nothing stopping you from changing the rules of your curating to include non-prefix.cc ontologies (but also you might then consider simply registering your pet ontologies at prefix.cc and just bump your fixed fime to a moment after your registration)

jonassmedegaard commented 1 year ago

All of prefix.cc is currently ~3000 ontologies.

The perl module RDF::NS::Curated provides a curated set of ~65 ontologies (as I recall it is simply "the most popular at prefix.cc at the time" but if curious you/I can simply ask Kjetil).

hoijui commented 1 year ago

ohhh perfect, thank you Jonas! :-) Sounds like I'd try that ... maybe those same ~65 then.

hoijui commented 9 months ago

I applied for funding for doing this on my own about a year ago with NLnet and got refused somewhen in Q1 this year. Since then, no new attempts from my side.

I still very much would like to have this mapping capability. Right now, I am starting a new project with Lynn from VF, creating an ontology for OSH. I would love to do it in AD instead of RDF, but can;t because of this missing.

hoijui commented 9 months ago

@joepio Do you have an idea for how to write an RDF ontology so that it would be easily and future-fully mappable to AD, once this mapping is implemented?

Thinking of: things to avoid using in RDF, and maybe extra properties to favor using or to be used on all classes/properties in the ontology . I guess the main thing would be the validation/data-type part.. right? Asking of course, so we could take it into consideration, now that we start writing our ontology. (We already did start, but it is still very small and completely mold-able.) Also good to know, to get to a point where we have a few such AD-mapping-ready RDF ontologies at some point, ready to test a mapping implementation, nice development starts on it. In the best case, such extra properties for AD would even be usable/make sense even disregarding AD, but that is less important.

joepio commented 9 months ago

Good question @hoijui!

I think most RDF ontologies / shacl shapes should be mappable to Atomic Data.

Some things to keep in mind:

hoijui commented 9 months ago

what about the data validation... would AD data validation map to SHACL, or an RDF property specially made for this (e.g. admapping:datatype)?

(something went wrong with the link in your comment)

hoijui commented 9 months ago

This gives some hints, I guess: https://docs.atomicdata.dev/interoperability/rdf.html?highlight=language%20tags#convert-atomic-data-to-rdf

So language tags working differently... is that really an issue when they are used, when there is software (the code doing the mapping) in-between? I am not talking about making RDF valid AD, just of it having the necessary data to map it to AD.

joepio commented 9 months ago

Fixed the link!

Yeah gotcha.

I think we can probably map pretty much everything at some point, like, we can always fall back to the 'string' datatype.

hoijui commented 9 months ago

ok.. I guess.. I'll not do anything special for now then. thanks!

joepio commented 9 months ago

gotcha :)