The current method of harvesting dump files uses parses the incoming dump in a streaming fashion, emitting statements as it finds them, and generates a SPARQL DELETE query for statements already in the database about that subject. This is horribly inefficient for two reasons:
Many of the incoming statements will be about the same subject so many fewer queries would be required to do this work
SPARQL DELETE can handle deleting statements about multiple subjects simultaneously. Some sort of batch delete could work here.
The consequence of how inefficient the current method is processing time. Importing a large > 100k triple graphs takes tens of minutes and possibly hours (I've never let it finish). Since we'd like to make the updating process happen in a timely fashion, this is too slow.
Work on modifying how importing is done so it's faster.
Right now, I could do this by putting the graph to be imported in the database in a temporary repo/context, running a SPARQL query using DISTINCT to grab the unique subjects and then generate a big SPARQL query to delete all of those subjects.
The current method of harvesting dump files uses parses the incoming dump in a streaming fashion, emitting statements as it finds them, and generates a SPARQL DELETE query for statements already in the database about that subject. This is horribly inefficient for two reasons:
The consequence of how inefficient the current method is processing time. Importing a large > 100k triple graphs takes tens of minutes and possibly hours (I've never let it finish). Since we'd like to make the updating process happen in a timely fashion, this is too slow.
Work on modifying how importing is done so it's faster.
Right now, I could do this by putting the graph to be imported in the database in a temporary repo/context, running a SPARQL query using DISTINCT to grab the unique subjects and then generate a big SPARQL query to delete all of those subjects.