MemoryOverflowModel transactional disk sync performs very poorly

abrokenjester commented 3 years ago

In the MemoryOverflowModel used by the native store to handle data upload transactions, a big performance penalty has been observed whenever the model triggers syncing data to disk. On the mailinglist, Arjohn Kampman provided this analysis:

The reason that syncing the data to disk when the overflow is triggered takes so long is also related to a call to dataset(). When overflowToDisk() calls disk.addAll(memory), this triggers a call to SailSourceModel.add(s,p,o,c...) for each statement. This method then calls both contain(s,p,o,c...) and sink().approve(s,p,o.c) for for each statement. The latter call starts a new transaction and updates the txn-status file, but the contains() call then commits the transaction for the previous statement via a call to dataset(), again updating the txn-status file. So for every cached statement, rdf4j does two I/O calls. On a spinning disk with a an average write time of 10 ms, this limits the overflow process to at most 50 statements per second.

We should look into fixing this. Possible fixes include somehow eliminating one of the mentioned txn-update calls. A more drastic fix might be looking at using a disk sync that is not transactional: we currently use separate native store as the "backup", which has overhead. Something like MapDB or Ehcache for persistence might give better performance.

daltontc commented 1 year ago

@abrokenjester Is this still on your guys' radar? I am seeing some pretty poor performance when loading in very large files with conn.add(File, Resource...) and a lot of it is being stuck in the MemoryOverflowModel:

hmottestad commented 1 year ago

You should be able to bypass it if you start a transaction with the NONE isolation level. As long as you don't need transaction isolation.

abrokenjester commented 1 year ago

It's kind of dropped off the radar a bit to be honest. @hmottestad 's suggestion to disable txn isolation for bulk uploads is a good workaround, but for fixing the underlying issue, we could use some fresh perspectives or an extra pair of hands.

daltontc commented 1 year ago

I'd be interested in potentially contributing a bit more to the project, but would definitely need to familiarize myself more with the codebase.

EDIT: removed message and made bug

kenwenzel commented 1 year ago

@abrokenjester As far as I can see contains(s,p,o,c...) is only used for tracking the size of the SailSourceModel and may be omitted. The challenge is then that add(...) and remove(...) return a boolean value to indicate that a statement is actually added or removed. But in the specific case investigated here the return value is not used by ChangeSet - neither here https://github.com/eclipse/rdf4j/blob/39a07105167201124c75bb9decc9749a143cc7e7/core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/Changeset.java#L418 nor here https://github.com/eclipse/rdf4j/blob/39a07105167201124c75bb9decc9749a143cc7e7/core/sail/base/src/main/java/org/eclipse/rdf4j/sail/base/Changeset.java#L449

kenwenzel commented 1 year ago

One of the main reasons for the bad performance is that MemoryOverflowModel switches too late to a disk-based model.

eclipse-rdf4j / rdf4j

MemoryOverflowModel transactional disk sync performs very poorly #2998