RMLio / rmlmapper-java

The RMLMapper executes RML rules to generate high quality Linked Data from multiple originally (semi-)structured data sources
http://rml.io
MIT License
147 stars 61 forks source link

Change default output store #108

Closed marioscrock closed 3 years ago

marioscrock commented 3 years ago

The current implementation uses the SimpleQuadStore as default output store if no option is provided for the output format (running RMLMapper via CLI, link to code). This leads to a very inefficient removeDuplicates procedure if the correspondent option is enabled (link to code), that can be avoided selecting a different output format. I think the performances can be easily improved by using the RDF4JStore as default if the duplicates removal option is enabled (no need for duplicates removal and nquads writer available in the RDF4J library). Is there any additional reason why SimpleQuadStore is used by default?

DylanVanAssche commented 3 years ago

Hi @marioscrock !

Thanks for using RML! In a next release, the changes for Logical Target [1] will be included. We will probably switch to a RDF4JStore there by default.

The reason for using SimpleQuadStore is that it's faster to add triples than in a RDF4JStore. However, this has drawbacks as you mentioned.

Once released, we will notify you in this issue and you can try out the changes.

[1] https://dylanvanassche.be/publications/#icwe2021

marioscrock commented 3 years ago

Hi @DylanVanAssche ! Thank you for your answer and for referencing Chimera in the publication, looking forward to the next releases! In the meantime, I would suggest to add at least a brief disclaimer in the Remarks section of the README. A user enabling the --duplicates option can experiment really different performances selecting different serialization formats, and it is difficult to grasp the reason if you don't look at the code.

cc: @dachafra

DylanVanAssche commented 3 years ago

@marioscrock

Thanks for your suggestion!

a brief disclaimer in the Remarks section of the README.

Something like this?

Performance depends on the serialization format (--serialization <format>) and if duplicate removal is enabled (--duplicates).

DylanVanAssche commented 3 years ago

@marioscrock This is addressed in v4.10.0, please let us know if this issue is fixed with the new release :)

marioscrock commented 3 years ago

@DylanVanAssche thank you for the update! I see the code for the selection of the output store based on the format is still present in the Main class (here), but If I understand correctly RDF4J is anyway used as default output store in the v5 API. If this is the case, given that you added also the remark in the README I think we can consider the issue as closed :)

DylanVanAssche commented 3 years ago

@marioscrock The CLI interface uses the SimpleQuadStore in some cases, that's true. The RDF4JStore is now used as default when no store is provided. I guess you use the library version right? You will have the RDF4JStore then.

marioscrock commented 3 years ago

My suggestion would be to change the default CLI behaviour if the --duplicates option is provided. In my experience, users using the RML Mapper jar via CLI may conclude that the "performance issue" is related to the duplicates removal while it can be solved by simply adding a --serialization option (that will drive the choice of the CLI interface in using the RDF4JStore). This is not very "intuitive", so I think it is useful for users to have a remark in the README until the behaviour is changed ;)