Open abrokenjester opened 3 years ago
@abrokenjester Is this still on your guys' radar? I am seeing some pretty poor performance when loading in very large files with conn.add(File, Resource...)
and a lot of it is being stuck in the MemoryOverflowModel:
You should be able to bypass it if you start a transaction with the NONE isolation level. As long as you don't need transaction isolation.
It's kind of dropped off the radar a bit to be honest. @hmottestad 's suggestion to disable txn isolation for bulk uploads is a good workaround, but for fixing the underlying issue, we could use some fresh perspectives or an extra pair of hands.
I'd be interested in potentially contributing a bit more to the project, but would definitely need to familiarize myself more with the codebase.
EDIT: removed message and made bug
@abrokenjester As far as I can see contains(s,p,o,c...)
is only used for tracking the size of the SailSourceModel and may be omitted. The challenge is then that add(...)
and remove(...)
return a boolean value to indicate that a statement is actually added or removed.
But in the specific case investigated here the return value is not used by ChangeSet
- neither here https://github.com/eclipse/rdf4j/blob/39a07105167201124c75bb9decc9749a143cc7e7/core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/Changeset.java#L418 nor here https://github.com/eclipse/rdf4j/blob/39a07105167201124c75bb9decc9749a143cc7e7/core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/Changeset.java#L449
see also https://github.com/eclipse/rdf4j/pull/4557
One of the main reasons for the bad performance is that MemoryOverflowModel switches too late to a disk-based model.
In the MemoryOverflowModel used by the native store to handle data upload transactions, a big performance penalty has been observed whenever the model triggers syncing data to disk. On the mailinglist, Arjohn Kampman provided this analysis:
We should look into fixing this. Possible fixes include somehow eliminating one of the mentioned txn-update calls. A more drastic fix might be looking at using a disk sync that is not transactional: we currently use separate native store as the "backup", which has overhead. Something like MapDB or Ehcache for persistence might give better performance.