Open dhaemer opened 4 months ago
Currently we can use 2 components to translate a JSON to LD: the JsonToLdAdapter and the RmlAdapter.
The JsonToLdAdapter checks whether the JSON is an array or an object. If it is an array, it is automatically split and therefore each item is pushed individually into the pipeline. If the JSON is an object, it is placed as a whol in the pipeline because we obviously do not know where the list of items is located and whether you need the other information from the message in the pipeline. In the latter case, as a first step you can use a SparqlConstructTransformer to put each item in a graph so that the rest of the pipeline processes each item individually. This is useful for (very) large JSON messages because the memory usage remains better under control and the rest of the pipeline can work with smaller triple sets internally.
The RmlAdapter can also process JSON messages, in addition to CSV messages. In a RML mapping you need to provide a JSON path where to find the items. For RML we use a Carml library which gives us one large triple set. We have already experimented a lot but cannot get it to return individual items (it may have to do with it using RD4J internally). Anyways, here too the solution is to use a SparqlConstructTransformer and put each item in a graph so that the rest of the pipeline works with individual items. The RmlAdapter can handle (large) CSV messages in the same way: we get one large triple set that we split manually using graphs.
Unfortunately, passing the message as a whole to both the JsonToLdAdapter and the RmlAdapter consumes a lot of memory and hurts performance, especially with very large messages (e.g. CSV with 5 million lines, JSON containing a few ten thousand items, etc.).
The solution IMHO is to provide the following:
JsonToLdAdapter: add an option (JSON path) that selects a JSON array (item collection) from a JSON object and then push each individual item into the pipeline, similar to what is already done for JSON messages containing an array. Note: typically, anything surrounding the item collection (i.e. everything but the item collection) is not needed. However we need to document it clearly that if you use the JSON path to select the item collection anything else is dropped.
RmlAdapter: RML can handle multiple formats but typically we only use CSV, JSON (if JsonToLdAdapter cannot be used) and to a lesser extend XML. Based on the mime type of the received message we should preprocess the message to split the message in pieces and hand these to the Carml library. Obviously there should be an implementation per message type:
@dhaemer if this answer is sufficient, please close this ticket
@Yalz It is unclear to me if the solutions suggested by @rorlic will be implemented by the current team in the following sprints. From a call we had it sounded that there was a possibility for this to happen. If this is the case they will certainly be sufficient.
To import datasets from one of our API's (json data), it would be nice to be able to read the file and import it without the need to write extra SPARQL construct queries to split up the data.
If we could pass the json path and the data is used from that path, it could save a lot of time implementing the LDES server. I would guess more people will have need of this.