GenSpectrum / LAPIS

An API, a query engine, and a database schema for genomic sequences; currently with a focus on SARS-CoV-2
https://lapis-three.vercel.app
GNU Affero General Public License v3.0
19 stars 5 forks source link

OutOfMemoryError causes LAPIS to stop working without crashing #812

Closed chaoran-chen closed 1 month ago

chaoran-chen commented 3 months ago

I think that sometimes, LAPIS can get an OutOfMemoryError and stop working but the program doesn't crash so Docker won't restart it. We should ensure that LAPIS terminates with a non-zero exit if it enters a non-working state (or avoid ending up in such a state in the first place).

fengelniederhammer commented 2 months ago

How can we reproduce this issue? How did this come up? How did you discover it?

pflanze commented 2 months ago

A question is why the JVM can't allocate memory: we have swap now. Either it allocates a lot of unused virtual memory; or there are some memory limits configured (for the Docker container?). Given enough Swap and no memory limits, there will be no memory allocation errors, but instead the OOM killer will choose a program to kill when actual memory (RAM + swap) is used up.

fengelniederhammer commented 2 months ago

We can try to reproduce it by limiting the memory to a quite low value and sending a query that needs a lot of memory (details? mutations with min proportion 0?)

Analyze what happens and understand the issue.

Then think about action, ideal outcome: We never end up in a non-responsive state. Either LAPIS fully crashes (so that e.g. Kubernetes knows that it needs to be restarted), or it survives and simiply returns an error.

JonasKellerer commented 2 months ago

My findings so far: I started LAPIS and SILO with data from mpox (715 entries). For LAPIS I added the following jvm options: JVM_OPTS=-XX:+UseContainerSupport -XX:+UseG1GC -XX:MaxHeapFreeRatio=5 -XX:MinHeapFreeRatio=2 -XX:MaxGCPauseMillis=100 and added a resource limit of 230M for the docker container. I started SILO and its preprocessing without any extra options.

Then I ran:

curl -X 'GET' \                                                                                                                                                   130 ↵ ──(Do,Jun13)─┘
  'http://localhost:8090/sample/nucleotideMutations?minProportion=0.001&dataFormat=JSON&downloadAsFile=false' \
  -H 'accept: application/json'

Then the same with minProportion=0.002, no error. Still LAPIS responded, however the memory usage increased to approximately 229M.

Then with minProportion=0.0005 I got an error: "title":"Internal Server Error","status":500,"detail":"Handler dispatch failed: java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects", which also showed in the LAPIS logs.

lapisOpen-1  | 2024-06-13 12:18:13,993 DEBUG [http-nio-8080-exec-7] [d2b1cdf5-2c2f-4d31-88da-957009e145ab] org.springframework.web.filter.CommonsRequestLoggingFilter: Before request [GET /sample/nucleotideMutations]
lapisOpen-1  | 2024-06-13 12:18:14,026 INFO [http-nio-8080-exec-7] [d2b1cdf5-2c2f-4d31-88da-957009e145ab] org.genspectrum.lapis.silo.CachedSiloClient: Calling SILO: {"action":{"minProportion":5.0E-4,"randomize":false,"type":"Mutations"},"filterExpression":{"type":"True"}}
lapisOpen-1  | 2024-06-13 12:18:14,580 INFO [http-nio-8080-exec-7] [d2b1cdf5-2c2f-4d31-88da-957009e145ab] org.genspectrum.lapis.silo.CachedSiloClient: Response from SILO: 200
lapisOpen-1  | 
lapisOpen-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "http-nio-8080-Poller"
lapisOpen-1  | 2024-06-13 12:18:23,580 WARN [http-nio-8080-exec-7] [d2b1cdf5-2c2f-4d31-88da-957009e145ab] org.genspectrum.lapis.controller.ExceptionHandler: Caught unexpected exception: Handler dispatch failed: java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
lapisOpen-1  | jakarta.servlet.ServletException: Handler dispatch failed: java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
lapisOpen-1  |  at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1104)
lapisOpen-1  |  at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:979)
lapisOpen-1  |  at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1014)
lapisOpen-1  |  at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:903)
lapisOpen-1  |  at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:564)
lapisOpen-1  |  at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:885)
lapisOpen-1  |  at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:658)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:195)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:51)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.springframework.web.filter.AbstractRequestLoggingFilter.doFilterInternal(AbstractRequestLoggingFilter.java:289)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.genspectrum.lapis.logging.RequestContextLogger.handleAndLogRequest(RequestContext.kt:56)
lapisOpen-1  |  at org.genspectrum.lapis.logging.RequestContextLoggerFilterAdapter.doFilterInternal(RequestContext.kt:36)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.genspectrum.lapis.controller.DownloadAsFileFilter.doFilterInternal(DownloadAsFileFilter.kt:34)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.genspectrum.lapis.controller.DataFormatParameterFilter.doFilterInternal(DataFormatParameterFilter.kt:43)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.genspectrum.lapis.controller.CompressionFilter.doFilterInternal(CompressionFilter.kt:153)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.genspectrum.lapis.auth.DataOpennessAuthorizationFilter.doFilterInternal(DataOpennessAuthorizationFilter.kt:58)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.genspectrum.lapis.logging.RequestIdFilter.doFilterInternal(RequestId.kt:38)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.springframework.web.filter.RequestContextFilter.doFilterInternal(RequestContextFilter.java:100)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.springframework.web.filter.FormContentFilter.doFilterInternal(FormContentFilter.java:93)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.springframework.web.filter.ServerHttpObservationFilter.doFilterInternal(ServerHttpObservationFilter.java:109)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:201)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:116)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:101)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:164)
lapisOpen-1  |  at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:140)
lapisOpen-1  |  at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:167)
lapisOpen-1  |  at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:90)
lapisOpen-1  |  at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:482)
lapisOpen-1  |  at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:115)
lapisOpen-1  |  at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:93)
lapisOpen-1  |  at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:74)
lapisOpen-1  |  at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:344)
lapisOpen-1  |  at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:389)
lapisOpen-1  |  at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:63)
lapisOpen-1  |  at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:896)
lapisOpen-1  |  at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1741)
lapisOpen-1  |  at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:52)
lapisOpen-1  |  at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1190)
lapisOpen-1  |  at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
lapisOpen-1  |  at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:63)
lapisOpen-1  |  at java.base/java.lang.Thread.run(Thread.java:1583)
lapisOpen-1  | Caused by: java.lang.OutOfMemoryError: Java heap space: failed reallocation of scalar replaced objects
lapisOpen-1  | 2024-06-13 12:18:23,642 DEBUG [http-nio-8080-exec-7] [d2b1cdf5-2c2f-4d31-88da-957009e145ab] org.springframework.web.filter.CommonsRequestLoggingFilter: After request [GET /sample/nucleotideMutations]

However, LAPIS was still running and was responsive (tested by calling aggregated endpoint)

Finally, I tried minProportion=0.003. This now turned LAPIS in an unresponsive state. The call did not finish, and also other calls to LAPIS did not go through. In the LAPIS logs I found the following:

lapisOpen-1  | 2024-06-13 12:19:14,936 ERROR [Catalina-utility-2] [] org.apache.juli.logging.DirectJDKLog: Error processing async timeouts
lapisOpen-1  | java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java heap space
lapisOpen-1  |  at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
lapisOpen-1  |  at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
lapisOpen-1  |  at org.apache.coyote.AbstractProtocol.startAsyncTimeout(AbstractProtocol.java:651)
lapisOpen-1  |  at org.apache.coyote.AbstractProtocol.lambda$start$0(AbstractProtocol.java:637)
lapisOpen-1  |  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
lapisOpen-1  |  at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358)
lapisOpen-1  |  at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
lapisOpen-1  |  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
lapisOpen-1  |  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
lapisOpen-1  |  at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:63)
lapisOpen-1  |  at java.base/java.lang.Thread.run(Thread.java:1583)
lapisOpen-1  | Caused by: java.lang.OutOfMemoryError: Java heap space

The last step is not reproducible all the time. Sometimes no error message shows up, but LAPIS is still unresponsive. A look on the resource consumptions shows that the memory consumption has gone down to 220M.

JonasKellerer commented 2 months ago

After more digging I found the following:

The cache we are using takes up some of the heap memory. The more requests are cached, the closer we get to the heap memory limit. One can set the heap memory limit by hand using -Xmx100M to a fixed value or by -XX:+UseContainerSupport to the limits of the container. However, this memory limit is for the full application.

When we have enough requests stored in the cache, then a new request can surpass the heap memory limit. If the limit is reached in our code, then we get a corresponding error message, which is then returned to the user and our application continues to run. However, this limit can also be reached deep in the code, especially in the tomcat part (our http web server). This is then somehow an error, which can not be recovered. Our application continues to run, but our webserver is down.

I am now following up on two ideas: first, we can try to check periodically if tomcat is still working, and if not we kill our application, second, we can try to limit the cache. Currently, we use the number of requests as a limit. Caffeine our cache provider does not allow to set a limit to the cache size in MB, however there might be a way to calculate a "weight" of a request and then limit the sum of weights in Caffeine.

A simple solution would be to limit the number of stored requests by the maintainer in the application.properties spring.cache.caffeine.spec=maximumSize=50000. The maintainer would then need to choose a fitting size, depending on the available heap size.

pflanze commented 2 months ago

using -Xmx100M to a fixed value

When doing that, did you remove the container limit to make sure the JVM doesn't still bump into it? If you didn't remove it then the JVM could still run into it while carrying out its own limiting: the limit given in -Xmx may refer to memory as seen by the Java application or otherwise excluding GC overhead, so a container limit might have to be possibly 2x larger or even more to avoid that. Best to just not set any container limit anymore.

However, this limit can also be reached deep in the code, especially in the tomcat part (our http web server). This is then somehow an error, which can not be recovered. Our application continues to run, but our webserver is down.

A process (on Linux/Unix) ends when the main thread ends (and I don't expect the JVM to virtualize and change this). If the thread running Tomcat stops listening (is this the case?) or serving requests then it sounds like it terminated, but if the process doesn't end at that point that sounds like that wasn't the main thread; and apparently the main thread doesn't exit at that point either, and I wonder why: what is it doing? Did it start Tomcat in another thread then go to sleep (instead of say, waiting for the Tomcat thread to terminate by thread-join'ing it)?

(I had a quick look but can't find any mention of Tomcat in the LAPIS source code other than a URL in a comment.)

check periodically if tomcat is still working, and if not we kill our application

I think that if reaching the point where that is necessary, removing memory limits altogether and instead letting the Linux kernel kill the app when necessary via its OOM (out of memory) killer (or even better, use the earlyoom daemon or other such customizable killer) is the better option: it would act immediately, and let the machine resources be used optimally (Linux/earlyoom have the full picture and know better when there's no memory left). It's just important to configure OOM killer scores / earlyoom settings so that Linux will preferentially kill e.g. LAPIS and not SILO (due to the latter being slower to start) (although I argue that SILO should run on another machine for security reasons). Linux/earlyoom would let large cached responses be swapped to disk and the app will continue to run, although swap is less performant than a separate file per cached response--and that would be another idea, cache the responses to filesystem files (possibly on tmpfs) instead of process RAM, then they don't count towards the -Xmx memory limit and the cache cleanup algorithm will work as is.

JonasKellerer commented 2 months ago

When doing that, did you remove the container limit to make sure the JVM doesn't still bump into it?

I tried it in two ways: First, I increased the container limit to a value, way above the limit. Still the same issue. Second, I tried it without putting LAPIS into a container and only ran it locally on my pc, without the container limit. The same issue.

(I had a quick look but can't find any mention of Tomcat in the LAPIS source code other than a URL in a comment.)

Spring boot uses tomcat under the hood. Its nothing we include by hand. I found that one can change the default web server, but I have no clue if this could help.

Personally, I don't like the idea of killing LAPIS because the size of the cache is too large. This is something we should have under our control. But when it comes to this we can have your suggestion in mind.

JonasKellerer commented 1 month ago

Maybe we can also use "softReferences" from caffeine directly. This is an easy option to toggle, where the garbage collector should remove these entries, when the memory is full.

pflanze commented 1 month ago

This is a feature unique to the JVM that I have never seen elsewhere. It probably means that if you use up the memory for other things (preparing a very large response in memory before sending it, or having many clients e.g. due to a DoS) then the cache will be all gone (which can have adverse effects in the DoS case); thus it will be even more important to have limits on those other resources (limit the number of http connections, the size of generated pages in memory, etc.). The issue of having to consciously limit things to avoid attacks doesn't go away--softReferences just allows you to make use of as much spare RAM as you can. Edit: i.e. when using softReferences, caches using them can be the only things that do not need limits put on them. But you still need limits for all other things (or you'll still run into the master limit (Docker memory limit) that you subject the JVM to, with the old problematic consequences).