Fluree crash leaves Athens in bad state

alexandergunnarson commented 2 years ago

Problem Fluree crashes with OOM on 4GB AWS instance (t1a.medium) with smallish graph, and Athens (apparently) doesn't try to reconnect. docker-compose apparently doesn't try to bring it back up automatically. docker-compose restart fixes the problem.

Granted, I'm using docker-compose up -d athens to avoid using nginx, so it may have something to do with it, but doubtful.

Screenshots/Demo

fluree_1  | 2022-01-25 18:54:20,953 ERROR f.db.ledger.transact - Fatal error, after an error processing a block an unexpected error happened trying to remove the involved transactions from raft state: ("503d895ee4aed8a0dc1d0e0a918f36e633ada861510d0343af8ecca23d684d28") - clojure.lang.ExceptionInfo: Command timed out.\n    at fluree.raft.events$register_callback_event$fn__64081$state_machine__5237__auto____64094$fn__64097.invoke(events.clj:130)\n at fluree.raft.events$register_callback_event$fn__64081$state_machine__5237__auto____64094.invoke(events.clj:122)\n   at clojure.core.async.impl.ioc_macros$run_state_machine.invokeStatic(ioc_macros.clj:978)\n      at clojure.core.async.impl.ioc_macros$run_state_machine.invoke(ioc_macros.clj:977)\n  at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invokeStatic(ioc_macros.clj:982)\n    at clojure.core.async.impl.ioc_macros$run_state_machine_wrapped.invoke(ioc_macros.clj:980)\n  at clojure.core.async$ioc_alts_BANG_$fn__5466.invoke(async.clj:421)\n   at clojure.core.async$do_alts$fn__5405$fn__5408.invoke(async.clj:288)\n       at clojure.core.async.impl.channels.ManyToManyChannel$fn__797.invoke(channels.clj:265)\n        at clojure.lang.AFn.run(AFn.java:22)\n        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n        at clojure.core.async.impl.concurrent$counted_thread_factory$reify__635$fn__636.invoke(concurrent.clj:29)\n   at clojure.lang.AFn.run(AFn.java:22)\n  at java.base/java.lang.Thread.run(Thread.java:829)\n
fluree_1  | 2022-01-25 18:54:55,705 INFO  fluree.db.server - SHUTDOWN Start - 
fluree_1  | 2022-01-25 18:55:26,356 INFO  fluree.db.ledger.stats - Memory:  {"used":"0.7 GB","committed":"1.6 GB","max":"2.0 GB","init":"1.0 GB","time":"2022-01-25T18:55:26.204571Z"} - 
...
fluree_1  | 2022-01-25 19:01:23,192 INFO  fluree.db.ledger.stats - Group state:  {"version":3,"leases":{"servers":{"myserver":{"id":"myserver","expire":1643137282289}}},"_work":{"networks":{"events":"myserver"}},"networks":{"events":{"dbs":{"log":{"status":"ready","block":1147,"index":742,"indexes":{"1":1642711235209,"353":1642803692749,"742":1642832535977}}}}}} - 
fluree_1  | #
fluree_1  | # There is insufficient memory for the Java Runtime Environment to continue.
fluree_1  | # Native memory allocation (mmap) failed to map 16384 bytes for committing reserved memory.
fluree_1  | # An error report file with more information is saved as:
fluree_1  | # /opt/fluree/hs_err_pid1.log
fluree_1  | [thread 52 also had an error]
fluree_1  | Java version 11.
...
athens_1  | 19:04:40.676 WARN  [async-dispatch-3] fluree.db.util.log - "Server contact error: " "xhttp error - http://fluree:8090/fdb/health - Don't know how to convert  into class java.lang.String" {:url "http://fluree:8090/fdb/health", :error :xhttp/unknown-error}
athens_1  | 19:05:48.132 WARN  [async-dispatch-4] fluree.db.util.log - "Connection has gone stale. Perhaps network conditions are poor. Disconnecting socket."
...
fluree_1  | 2022-01-25 19:04:43,188 INFO  fluree.db.server - JVM arguments:  {:jvm "OpenJDK 64-Bit Server VM", :input ["-Xmx2g" "-Xms1g" "-XX:+UseG1GC" "-XX:MaxGCPauseMillis=50" "-Dfdb-storage-file-root=/var/lib/fluree/" "-Dfdb-group-log-directory=/var/lib/fluree/group/" "-Dfdb.properties.file=./fluree_sample.properties" "-Dfdb.log.ansi" "-Dlogback.configurationFile=./logback.xml"]} - 
fluree_1  | 2022-01-25 19:04:43,202 INFO  fluree.db.server - Memory Info:  {:used 0.3 GB, :committed 1.7 GB, :max 2.0 GB, :init 1.0 GB, :time 2022-01-25T19:04:43.194182Z} -

# While SSH'ed into the machine
curl localhost:3010
# =>
# curl: (7) Failed to connect to localhost port 3010: Connection refused

Athens Version v2.0.0-beta.12

filipesilva commented 2 years ago

I think what's happening here is:

the java process in the fluree container says it doesn't have enough memory, and kills itself
the athens process in the athens container tries to connect to fluree, but can't, and just hangs there indefinitely
the fluree container tries to restart its java process continuously, maybe succeeding, maybe failing
the athens process doesn't try to connect again

We've seen a similar problem in our server when we were indeed out of memory due to other things running in the background. So I think the way forward for you is to either increase the memory on that server (we use 8gb in ours, but we have more data too I think), or to check if there's something else eating up the memory in that server.

alexandergunnarson commented 2 years ago

Makes total sense and lines up with what I was seeing. There’s nothing else on the server so we’ll have to bump memory. Strange though, because not only are we not running nginx, but we don’t have much data yet. I suppose 2GB each for fluree and Athens is pretty paltry for Clojure.

filipesilva commented 2 years ago

To be honest is really surprises me that you're running into memory problems on a small graph.

Our team graph only showed those problems after several months of use, and because we were using several gigs of memory in other background processes in that machine.

Fluree itself needs about 1gb to run (but this can be adjusted I think) and the Athens server needs about 2gb (we've spent 0 effort optimising this yet).

alexandergunnarson commented 2 years ago

Also surprised! Yeah we don't have any background processes other than those required to run Linux and Docker. I wonder if it's either 1) there's native memory used in addition to heap memory, or 2) because htop says there's 3.8 GB of total memory, not 4GB. grep MemTotal /proc/meminfo says 3989320 kB which seems right. Perhaps we can try giving -Xmx a buffer value of, say, 10%.

visika commented 1 year ago

My server happened to crash; I tried the solutions listed but fluree keeps being unhealthy, and I didn't setup the backup yet. At this point it appears the backup utility can't connect to the fluree database and can't produce the backup. Is my data lost forever?

athensresearch / athens

Fluree crash leaves Athens in bad state #2001