Informatievlaanderen / VSDS-Linked-Data-Interactions

https://informatievlaanderen.github.io/VSDS-Linked-Data-Interactions/
European Union Public License 1.2
4 stars 6 forks source link

Can we get more options to input larger data sets from json/csv/... sources #588

Open dhaemer opened 4 months ago

dhaemer commented 4 months ago

To import datasets from one of our API's (json data), it would be nice to be able to read the file and import it without the need to write extra SPARQL construct queries to split up the data.

If we could pass the json path and the data is used from that path, it could save a lot of time implementing the LDES server. I would guess more people will have need of this.

rorlic commented 4 months ago

Currently we can use 2 components to translate a JSON to LD: the JsonToLdAdapter and the RmlAdapter.

The JsonToLdAdapter checks whether the JSON is an array or an object. If it is an array, it is automatically split and therefore each item is pushed individually into the pipeline. If the JSON is an object, it is placed as a whol in the pipeline because we obviously do not know where the list of items is located and whether you need the other information from the message in the pipeline. In the latter case, as a first step you can use a SparqlConstructTransformer to put each item in a graph so that the rest of the pipeline processes each item individually. This is useful for (very) large JSON messages because the memory usage remains better under control and the rest of the pipeline can work with smaller triple sets internally.

The RmlAdapter can also process JSON messages, in addition to CSV messages. In a RML mapping you need to provide a JSON path where to find the items. For RML we use a Carml library which gives us one large triple set. We have already experimented a lot but cannot get it to return individual items (it may have to do with it using RD4J internally). Anyways, here too the solution is to use a SparqlConstructTransformer and put each item in a graph so that the rest of the pipeline works with individual items. The RmlAdapter can handle (large) CSV messages in the same way: we get one large triple set that we split manually using graphs.

Unfortunately, passing the message as a whole to both the JsonToLdAdapter and the RmlAdapter consumes a lot of memory and hurts performance, especially with very large messages (e.g. CSV with 5 million lines, JSON containing a few ten thousand items, etc.).

The solution IMHO is to provide the following:

Yalz commented 3 months ago

@dhaemer if this answer is sufficient, please close this ticket

dhaemer commented 3 months ago

@Yalz It is unclear to me if the solutions suggested by @rorlic will be implemented by the current team in the following sprints. From a call we had it sounded that there was a possibility for this to happen. If this is the case they will certainly be sufficient.