eea / eea.elasticsearch.river.rdf

EEA ElasticSearch RDF River Plugin
GNU General Public License v2.0
61 stars 15 forks source link

JSONLD support #4

Open rancas opened 10 years ago

rancas commented 10 years ago

Hello, I'd like to thank you for the great job you are doing on this project. It would be very interesting if the Elasticsearch index could be populated with JSONLD (http://json-ld.org/). Perhaps a Java verion of JSONLD library (https://github.com/jsonld-java/jsonld-java) can be used to serialize the RDF triples automatically and an external context can be set up in the JSONLD.

demarant commented 10 years ago

@rancas Thank you for the suggestion.

Infact in my initial proposal for this product I had mentioned that we should check the JSON-LD format. see original idea ticket http://taskman.eionet.europa.eu/issues/15265 and I wrote intentionally "...we could check if JSON-LD works fine with ES..."

we did not have the time to look into JSON-LD in our first release of the rdf river plugin, ...we are quite new to ElasticSearch as well and therefore a lot of new things to learn.. unfortunately sparql endpoints do not return JSON-LD directly ...so the transformation has to be done on the river plugin side, which will add up on performance and complexity of the plugin. we also do not know how well the JSON-LD works in ES and what restriction it will add for indexing. So there are many things here to evaluate and test before going for JSON-LD.

At the moment I have not find a real use case for using JSON-LD in ES ... maybe it will come up in the future.

In anycase, we are quite open to external contributions. So you are welcome to contribute to this plugin with JSON-LD import :)

rancas commented 10 years ago

@demarant we are experimenting the use of JSONLD with Elasticsearch and, if it can be useful, I can report you the result of our experience. Right now it seems there is no problem at all using JSONLD because it is just a standard JSON with a "@context" key where you can put your semantification mappings. Currently we are going the other way around starting with JSONLD in Elasticsearch and then going to the triple store with a simple serialization. Below you can see an example code in python using RDFLib jsonld implementation (https://github.com/RDFLib/rdflib-jsonld):

from rdflib import Graph, plugin, ConjunctiveGraph
from rdflib.serializer import Serializer
#using plugin rdflib-jsonld we easily can transform jsonld in N3
jld="""
{
  "@context": {
    "ical": "http://www.w3.org/2002/12/cal/ical#",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "ical:dtstart": {
      "@type": "xsd:dateTime"
    }
  },
  "ical:summary": "Lady Gaga Concert",
  "ical:location": "New Orleans Arena, New Orleans, Louisiana, USA",
  "ical:dtstart": "2011-04-09T20:00Z"
}
""""
g = Graph().parse(data=jld, format='json-ld')
print "\n\n\n"
print "######################### N3 #########"
print(g.serialize(format='n3', indent=4))

This works in reverse when you have a graph and you want to serialize it in JSONLD, using something like:

print(g.serialize(format='json-ld', indent=4))

I already took a look at class Harvester private method addModelToES, where the transformation of the graph in JSON took place (if I understand correctly). It seems to me that an analogous method to create JSONLD could in python be something like model.serialize(format='json-ld', indent=4) and then the ES indexing stuff.

Hope it helps. Perhaps I can give you a more concrete contribution next month :-)

demarant commented 10 years ago

@rancas thank you for the further information and good news :)I I will also have a look in due time with the main developer of this plugin.

Some extra thoughts: I like the shorter form of JSONLD when using @context which specify the prefixes and compact IRI... however that could create some issues to the ES mapping configuration of each field. In theory two fields could look the same for ES but have different contexts (different types). ... for example there could be one document with namespace1:myfield and another document with namespace1:myfield which have different @context. assume one is a Date type and the other is a string type. ES does not know anything about the fact that they have different context associated (ES does not understand the semantics of JSONLD of course) therefore the two fields that are conceptually different will treated the same, with the same mapping, because they are in the same index and they are called the same way....so basically you can't give different ES index mappings to those two different fields, because they look identical to ES. so basically you can't mix up documents in same ES index...you must make sure all the JSONLD documents in the same index have no clashes and consistent schema prefixes (dct, ical, foaf etc..).

Maybe the chance of these kind of "prefix clashes" are not so high if one make sure all namespaces prefixes used are unique and consistent within the same index. In this case it will work fine in ES I guess. Moreover JSONLD also allow to just have the full IRI for each field, without using the namespace/shorthand form, still valid JSON and reduce the issues stated above.

We are very grateful for your tests and analysis and look forward for a possible contribution to this plugin :-)

iulia-pasov commented 10 years ago

@rancas although we have considered using the JSON-LD format we did not find more advantages over the classical JSON. ES does not require semantically relevant information but resources that can be indexed. However, since the rdf graph is already created, a JSON-LD export is possible from that form, but the features need to be adjusted, since they are processed on a Map. Since at this point there are no performance advantages if using JSON-LD, we have decided to wait for it to be available in Jena. Therefore, JSON-LD will be considered in the future.

However, we do appreciate any advice or contributions.

stain commented 9 years ago

FYI: Jena now has JSON-LD support built in.