Investigate RocksDB memory usage

theferrit32 commented 1 year ago

There are some buffers the rocksdb library uses internal in open database objects to reduce write loads and perform some pre-write optimizations. Not just the write queue. Apparently they can get quite large. And since they aren't managed by the JVM, reducing -Xmx has no effect on the size or flushing of those buffers.

I just ran a snapshot job for the clinvar data with java option -Xmx512m and kubernetes container memory limit 2Gi and it was OOMKilled like 75% through the job. This seems like a pretty high ratio of host memory to JVM heap memory for the JVM process to still exceed the host memory limit and get OOM killed by kubernetes. I don't see any glaring issues in the clojure code like a sequence head retention that would cause a memory leak. So the next thing I'd look at would be the rocksdb objects and the buffers they have. The memory usage in those buffers could grow significantly now that we have many more rocksdb objects, one for each data type.

https://github.com/facebook/rocksdb/issues/280

https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB

theferrit32 commented 1 year ago

Trying a workaround to update the kubernetes version to 1.25+ and ubuntu version in the genegraph docker image to 22.04+ (jammy+) to enable cgroups v2 which might have a enable better handling of host kernel page cache for kubernetes containers with a memory limit defined.

theferrit32 commented 1 year ago

Trying an additional tweak to the options used when opening the RocksDB object that reduces the resident memory by loading index blocks under the same size quota as the regular block cache. For our databases, the index sizes are substantial (for clinical_assertion in the february 10 release it gets close to 400MB) and by default RocksDB loads all of them into memory to improve random access pattern response times since the index blocks are stored compressed on disk so loading one from kernel page cache isn't free because it must be decompressed into process block cache.

Setting setCacheIndexAndFilterBlocks true on the Options object corresponds to the cache_index_and_filter_blocks option in the native library.

(https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#indexes-and-filter-blocks)

This almost completely flattens memory usage during stream loading, since the memory usage in the rocksdb library is now capped. The read/write buffers in the library are bounded by default, only the index cache was unbounded.

(defn open [db-name]
  (let [full-path (create-db-path db-name)
        opts (doto (Options.)
               (.setTableFormatConfig
                (doto (org.rocksdb.BlockBasedTableConfig.)
                  (.setCacheIndexAndFilterBlocks true)))
               (.setCreateIfMissing true))]
    (io/make-parents full-path)
    ;; todo add logging...
    (RocksDB/open opts full-path)))

theferrit32 commented 1 year ago

Unfortunately this doesn't solve the issue, because several hours in there is still an rapid increase in memory usage that exceeds the available.

One feature I want to add to genegraph is a periodic logging of memory usage. Estimated usage for block cache and index blocks can be queried via properties on the RocksDB object.

This outputs a list of vectors of the RocksDB var and the property value:

(->> (mount/running-states)
       (map read-string)
       (map eval)
       (map #(vector % (var-get %)))
       (filter #(instance? org.rocksdb.RocksDB (second %)))
       (map  #(vector (first %) (-> % second (.getProperty "rocksdb.estimate-table-readers-mem")))))

Example:

'([#'genegraph.transform.clinvar.variation/variation-data-db "110171728"] …)

The next thing I will do is look at potential issues with iterators keeping stuff in memory.

https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#blocks-pinned-by-iterators

theferrit32 commented 1 year ago

Issue resolved by replacing ga4gh/latest-versions-seq with new implementation at ga4gh/latest-versions-seq-all. The new implementation does not use any prefix iterators, does not use any skipping or reversing the top level iterator, and does all the filtering and finding latest just using a sequence of clojure collection functions over the sequential elements in the entire db iterator. It also requires the caller to manage the open reference to the RocksIterator object used.

https://github.com/clingen-data-model/genegraph/blob/3b9232931e4ebf94ada4b056ab47753667dc221e/src/genegraph/transform/clinvar/ga4gh.clj#L210-L230

theferrit32 commented 1 year ago

Process resident memory stayed below 2.5GB. And after peaking at that, settled around 1.5-1.75. There was a small uptick when the database loading finished and the database iteration to write the snapshots started around 5:22AM

theferrit32 commented 1 year ago

Closed in #768

May need to be revisited to guide future changes to how we manage open dbs. Maybe changing it to use a single rocksdb instance with multiple column families (one per data type) would be easier to manage than splitting it across multiple rocksdb instances with a single default column family. Not sure about the memory implications either, if buffers are per column family or per db.

clingen-data-model / genegraph

Investigate RocksDB memory usage #763