clingen-data-model / genegraph

Presents an RDF triplestore of gene information using GraphQL APIs
5 stars 0 forks source link

RocksDB memory still grows vastly beyond JVM and configured rocksdb buffer limits #776

Closed theferrit32 closed 7 months ago

theferrit32 commented 1 year ago

Okay I think I know what's going on with the rocksdb memory now. The native rocksdb C++ objects are destroyed via the JNI interface in the java finalize() method, which is called when the object is garbage collected. However, the C++ object and the amount of process memory it is using is not visible to the JVM (it doesn't even show up when using -XX:NativeMemoryTracking), so the JVM will eventually clean up the native C++ objects, but it doesn't know when it needs to. If there are a lot of those objects being created, but not creating a lot of garbage in the JVM itself, the JVM doesn't know it is rapidly using memory beyond its heap limit and should run the GC. So the JVM will happily keep running and allocating objects thinking it's still below the memory limit, even if it's not.

This generates a few MB of JVM garbage, but several GB of native memory allocations:

(doseq [i (range (long 1e8))]
  (doto (org.rocksdb.Options.)))

And those native objects sit in memory for a while because the JVM doesn't think it needs to run GC (it will eventually happen, but maybe not for a while)

Adding an explicit .close calls the cleanup on the C++ object. So this destroys the objects immediately. The JVM object remains and will be GC'ed later, but the C++ object is already freed, so this issue doesn't happen as much. Technically the not-yet-closed RocksDB heap objects are still using a little bit more memory than the JVM is aware of, but if there's enough free memory, this isn't a problem.

(doseq [i (range (long 1e8))]
  (doto (org.rocksdb.Options.)
    (.close)))

This 3rd party wiki here explains this: https://github.com/EighteenZi/rocksdb_wiki/blob/master/RocksJava-Basics.md#memory-management

larrybabb commented 1 year ago

@theferrit32 definitely want a status on this. I think you've fixed it, but I didn't read the whole thread. In any case, let's make sure this is up to date and categorized properly.