b2ihealthcare / snow-owl

:owl: Snow Owl Terminology Server - a production-ready, scalable, FHIR Terminology Service compliant server that supports SNOMED CT International and Extensions, LOINC, RxNorm, UMLS, ICD-10, custom code systems and many others
https://docs.b2ihealthcare.com/snow-owl/
Apache License 2.0
258 stars 32 forks source link

Remote job cleanup exception kills the whole app/ API section #462

Closed ivelina-yordanova closed 4 years ago

ivelina-yordanova commented 4 years ago

While playing around with the FHIR API (using the GET endpoints only) I suddenly started getting timeout responses. This then turned into the whole application being unable to load. After restarting the pods the other endpoints were working, but all FHIR ones throw this exception, on each call:

Exception in thread "Remote job cleanup" com.b2international.index.IndexException: Failed to execute index write
[2019-12-24T14:06:16.043] ERROR qtp1714099025-77                                 com.b2international.snowowl.core.rest.ControllerExceptionMapper  Exception during request processing org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
    at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
    at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:79)
    at org.springframework.web.context.request.async.WebAsyncManager.lambda$startDeferredResultProcessing$5(WebAsyncManager.java:427)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
    at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:150)
    at org.eclipse.jetty.server.HttpChannelState$2.run(HttpChannelState.java:610)
    at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1436)
    at org.eclipse.jetty.server.HttpChannelState.runInContext(HttpChannelState.java:1135)
    at org.eclipse.jetty.server.HttpChannelState.onTimeout(HttpChannelState.java:631)
    at org.eclipse.jetty.server.AsyncContextEvent.lambda$run$0(AsyncContextEvent.java:164)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
    at java.base/java.lang.Thread.run(Thread.java:834)

    at com.b2international.index.DefaultIndex.write(DefaultIndex.java:51)
    at com.b2international.snowowl.datastore.remotejobs.RemoteJobTracker$CleanUpTask.run(RemoteJobTracker.java:73)
    at java.base/java.util.TimerThread.mainLoop(Timer.java:556)
    at java.base/java.util.TimerThread.run(Timer.java:506)
Caused by: java.io.IOException: listener timeout after waiting for [30000] ms
    at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:912)
    at org.elasticsearch.client.RestClient.performRequest(RestClient.java:233)
    at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1764)
    at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1734)
    at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1696)
    at org.elasticsearch.client.RestHighLevelClientExt.clearScroll(RestHighLevelClientExt.java:61)
    at com.b2international.index.es.client.http.EsHttpClient.clearScroll(EsHttpClient.java:153)
    at com.b2international.index.es.EsDocumentSearcher.search(EsDocumentSearcher.java:216)
    at com.b2international.snowowl.datastore.remotejobs.RemoteJobTracker$CleanUpTask.lambda$0(RemoteJobTracker.java:74)
    at com.b2international.index.DefaultIndex.write(DefaultIndex.java:47)
    ... 3 more

Now, I have no way of fixing that because it seems like the restart even doesn't do it.

I am not sure if this is something intentional but I wouldn't expect a cleanup/maintenance job should kill the whole app.

Any advice how to recover from this? Should deleting the index, somehow, fix it or would that make things worst?

ivelina-yordanova commented 4 years ago

So it ended up being a bigger problem after the restart and few Get tries, it started throwing a LOT of exceptions ( Review cleanup, classification...) and ended up dying with OutOfMemory.

cmark commented 4 years ago

Hi @ivelina-yordanova,

I don't think it is the maintenance/cleanup thread that kills the app. Instead, from the symptoms you have described, I think the app is dying because it does not have enough memory and constantly tries to run the GC. There should be a JVM GC log somewhere in the home directory of the application which can be helpful to resolve memory/GC issues. Could you please send us that file for further investigation?

Thanks, Mark

ivelina-yordanova commented 4 years ago

Hi @cmark ,

It did end up dying with OOM exception but I'd think it would work fine at least in the beginning after a pod restart and it didn't, only the other API enpoints were working after a restart and the FHIR ones weren't.

What helped "fix" the FHIR endpoints, was flushing the jobs index and clearing the cache when the pod died again and restarting it again. This is what makes me believe the index issue is what had a role in this but to be fair I had not played too much with the endpoints since.

I'll try to dig out the logs but it's been almost 10 days, not sure I will be able to get something useful out. I have the logs from one of the pods in one of the fails but it's full with the same exceptions above again and again. Will try to get some info about the JVM though.

Thanks, Ivelina

ivelina-yordanova commented 4 years ago

Hi @cmark,

I could not find anything really. We only keep the logs for 7 days back and the fhir endpoints are behaving as before now (after the index flush and cache clean). Unfortunately, we are also not tracking the metrics at the moment (created a ticket for that now). Is there anything you can think of that I might try to reproduce this (I've got no idea what triggered it)? Also, if it really has to do with the available memory, are there any guidelines as to how much the app would need, does it grow proportionally to something ?

Thanks,

cmark commented 4 years ago

Hi @ivelina-yordanova,

Well, assume that the OOM is indeed related to the jobs cleanup process, then you need to create lots of jobs that produce large result sets (branch compare for example with large changesets) and mark them for deletion. The next time the cleanup task runs it will try to load the entries, but fails to do so due to the lack of free memory. I don't think it is possible to flood the system this way, but in any case, I made a change to reduce the amount of memory required when detecting job entries during the cleanup process (see https://github.com/b2ihealthcare/snow-owl/commit/b43d65810887a7c1cb9a5ac40970c4c8df2c0efe).

How much RAM do you assign to your Snow Owl instance? The recommended setting is mostly based on the use case. Performing large ECL queries, comparisons of branches with large changesets, classifying SNOMED CT and in general, any operation that retrieves large amounts of data at once require lot of memory. The default memory setting in the 7.2.0 is 6 GB. According to our observations and experience, this is enough for general use cases. Known Snow Owl production environments are using around 12-16GB avg of heap.

Let me know if you have any further questions. Cheers, Mark

ivelina-yordanova commented 4 years ago

Hi @cmark,

I have tried bumping up the memory to 16GB and did the following:

and it crashed even with this amount of memory with the exact same exceptions in the beginning and then eventually ending up with OOME.

It's worrying because we do not have all the data we would like to have in there so would the app memory increase proportionally to the dataset amount.

Maybe it would make sense to have some sort of a strategy for such cases so that it does not affect the whole app?

Also, would it be possible to make the metrics endpoint more configurable - the standard is "/metrics" and besides enabling and disabling it we cannot do much so we'd have to do some ugly hack to redirect that.

Thanks,

cmark commented 4 years ago

Hi @ivelina-yordanova,

Could you please share the FHIR requests you have fired with us? (eg. were there any ValueSet expansion or something and also how big is the requested data) By cancelling them you mean the HTTP requests, right?

Maybe it would make sense to have some sort of a strategy for such cases so that it does not affect the whole app?

Good idea, but first, it would be great to figure out what's causing the memory issue exactly. Currently, there is no restriction in terms of the requested dataset size. If the requested amount of data cannot fit into memory before writing it out to the HTTP response, then it will cause OOM eventually.

Also, would it be possible to make the metrics endpoint more configurable - the standard is "/metrics" and besides enabling and disabling it we cannot do much so we'd have to do some ugly hack to redirect that.

What customizations would you like to apply here? And may I ask why do you need to redirect it?

Cheers, Mark

ivelina-yordanova commented 4 years ago

Hi @cmark,

By cancelling them you mean the HTTP requests, right?

Yes

Could you please share the FHIR requests you have fired with us?

The request are the ones I quoted above: 1) GET /fhir/CodeSystem 2) GET /fhir/ValueSet 3) GET /fhir/ConceptMap - which always returns 500 and appears to have a lot to do with this issue

So, I've tried few diff things:

So, my guess it that some exception gets thrown and swallowed somewhere in the /fhir/ConceptMap endpoint handling and when that happens in combination with the other requests which have bigger results sets, it causes all the other problems.

What customizations would you like to apply here? And may I ask why do you need to redirect it?

We've got a prometheus for the whole company that expects all services to expose the metrics at /metrics and I would like to track the snowowl ones, but have to use a hack or something ugly to do that at the moment.

Thanks

cmark commented 4 years ago

Hi @ivelina-yordanova,

GET /fhir/ConceptMap - which always returns 500 and appears to have a lot to do with this issue Could you please report this issue along with the stack trace from the logs in a separate GitHub ticket?

if I only fire a the requests for 1) and 2) normally, without cancelling - they always return 200 and the result sets are fairly big so I am highly suspicions that alone is the root cause.

How big exactly? 100Mb JSON? May I ask, which CodeSystem and which ValueSet?

Anything else happens on the system or just these three requests?

Thanks, Mark

ivelina-yordanova commented 4 years ago

Hi @cmark,

I'll create a separate issue for the /fhir/ConceptMap exceptions but I think it's not entirely unrelated to this issue.

Also, here's more info on the response sizes of the other 2 endpoints:

-rw-r--r--` 1 ivelina.yordanova ** 1051189 8 Jan 09:09 codeSystemsResponse

-rw-r--r-- 1 ivelina.yordanova ** 29383 8 Jan 09:12 valueSetsResponse

Certainly not huge, but the biggest we could get atm. Is it possible for this to have something to do with other config/setup?

How big is the data you have been testing with?

Oh, and nothing else happing in the system, besides maybe an occasional request, but no one is really yet using this besides the team so I'd hope this doesn't cause this.

Thanks,

cmark commented 4 years ago

Hi @ivelina-yordanova,

We are testing mainly with the entire UK dataset internally. It looks like the implementation loads all members of each mapping reference set from the current SNOMED data and that will definitely cause OOMEs eventually. I've changed the logic (https://github.com/b2ihealthcare/snow-owl/commit/0ed6b1c31d218b841514c613fee4fa4f410ce8c8) to exclude the members of a SNOMED CT ConceptMap by default for now. Not sure if there is a use case that requires all members to be loaded at once (especially when you have a million member like in one UK refset) and the specification does not specify any kind of paging for that particular property in the official docs.

We will consider adding a custom solution to paging in Snow Owl FHIR implementation, probably implementating it during the refactoring of how URIs work currently (see branch https://github.com/b2ihealthcare/snow-owl/tree/improvement/codesystem-uris-property).

Could you please try the fix (the 7.x image will be updated in about 30 minutes) in your environment to see whether it fixes the memory issue or not?

Thanks, Mark

ivelina-yordanova commented 4 years ago

Hi @cmark,

I just deployed and tried the 7.x version.

Now, the /fhir/ConceptMap endpoint returns (is not timing out) but it's still 500, this time the Exception is a NPE:

[2020-01-08T10:35:42.752] ERROR qtp208141938-46                                  c.b2international.snowowl.fhir.rest.BaseFhirResourceRestService  Exception during processing of a request java.lang.NullPointerException: null
    at com.b2international.snowowl.snomed.fhir.SnomedConceptMapApiProvider.buildGroup(SnomedConceptMapApiProvider.java:404)
    at com.b2international.snowowl.snomed.fhir.SnomedConceptMapApiProvider.buildConceptMap(SnomedConceptMapApiProvider.java:375)
    at com.b2international.snowowl.snomed.fhir.SnomedConceptMapApiProvider.lambda$1(SnomedConceptMapApiProvider.java:110)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
    at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
    at com.b2international.snowowl.snomed.fhir.SnomedConceptMapApiProvider.lambda$0(SnomedConceptMapApiProvider.java:112)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
    at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
    at com.b2international.snowowl.snomed.fhir.SnomedConceptMapApiProvider.getConceptMaps(SnomedConceptMapApiProvider.java:115)
    at com.b2international.snowowl.fhir.rest.FhirConceptMapRestService.getConceptMaps(FhirConceptMapRestService.java:98)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:190)
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:138)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:104)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:892)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:797)
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1039)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:942)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1005)
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:897)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:882)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:320)
    at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:127)
    at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:91)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:119)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.session.SessionManagementFilter.doFilter(SessionManagementFilter.java:137)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:111)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.servletapi.SecurityContextHolderAwareRequestFilter.doFilter(SecurityContextHolderAwareRequestFilter.java:170)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.savedrequest.RequestCacheAwareFilter.doFilter(RequestCacheAwareFilter.java:63)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.authentication.logout.LogoutFilter.doFilter(LogoutFilter.java:116)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.header.HeaderWriterFilter.doFilterInternal(HeaderWriterFilter.java:74)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:118)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:105)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.context.request.async.WebAsyncManagerIntegrationFilter.doFilterInternal(WebAsyncManagerIntegrationFilter.java:56)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:118)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.FilterChainProxy.doFilterInternal(FilterChainProxy.java:215)
    at org.springframework.security.web.FilterChainProxy.doFilter(FilterChainProxy.java:178)
    at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:357)
    at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:270)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
    at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:200)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:118)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1700)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1667)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.server.Server.handle(Server.java:505)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
    at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
    at java.base/java.lang.Thread.run(Thread.java:834)

This is without messing up with the request and trying cancelling it, it's now really quick so I can't even cancel it.

I have not yet tried to reproduce the OOM but the last time it involved cancelling this endpoint and now it will be impossible

Thanks,

cmark commented 4 years ago

Hi @ivelina-yordanova,

Yeah, that's my fault, I wanted to quickly solve the issue. 😅 Please try again in 30 minutes.

Cheers, Mark

ivelina-yordanova commented 4 years ago

Hi @cmark,

It now works! Well, sort of, it's returning :

{ "resourceType": "OperationOutcome", "issue": [ { "severity": "error", "code": "invalid", "diagnostics": "1 validation error", "details": { "text": "Parameter 'elements' content is invalid [[]]. Violation: may not be empty.", "coding": [ { "code": "msg_param_invalid", "system": "http://hl7.org/fhir/operation-outcome", "display": "Parameter 'elements' content is invalid [[]]. Violation: may not be empty." } ] }, "location": [ "Group.elements" ], "expression": [] } ] } which might be because you're now not expanding the members?

cmark commented 4 years ago

Hi @ivelina-yordanova,

Yup, that was picked up by our build system as well, fixing it right now. Try again in 15 minutes or so.

Cheers, Mark

ivelina-yordanova commented 4 years ago

Hi @cmark,

Now, it worked once, I saw a result, but then started timing out and there are other exceptions :

o.s.w.s.mvc.method.annotation.ExceptionHandlerExceptionResolver  Failure in @ExceptionHandler public com.b2international.snowowl.fhir.core.model.OperationOutcome com.b2international.snowowl.fhir.rest.BaseFhirResourceRestService.handle(java.lang.Exception) org.eclipse.jetty.io.EofException: Closed
    at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:491)
    at org.springframework.security.web.util.OnCommittedResponseWrapper$SaveContextServletOutputStream.write(OnCommittedResponseWrapper.java:639)
    at com.fasterxml.jackson.core.json.UTF8JsonGenerator._flushBuffer(UTF8JsonGenerator.java:2085)
    at com.fasterxml.jackson.core.json.UTF8JsonGenerator.flush(UTF8JsonGenerator.java:1097)
    at com.fasterxml.jackson.databind.ObjectWriter.writeValue(ObjectWriter.java:915)
    at org.springframework.http.converter.json.AbstractJackson2HttpMessageConverter.writeInternal(AbstractJackson2HttpMessageConverter.java:287)
    at org.springframework.http.converter.AbstractGenericHttpMessageConverter.write(AbstractGenericHttpMessageConverter.java:103)
    at org.springframework.web.servlet.mvc.method.annotation.AbstractMessageConverterMethodProcessor.writeWithMessageConverters(AbstractMessageConverterMethodProcessor.java:290)
    at org.springframework.web.servlet.mvc.method.annotation.RequestResponseBodyMethodProcessor.handleReturnValue(RequestResponseBodyMethodProcessor.java:180)
    at org.springframework.web.method.support.HandlerMethodReturnValueHandlerComposite.handleReturnValue(HandlerMethodReturnValueHandlerComposite.java:82)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:122)
    at org.springframework.web.servlet.mvc.method.annotation.ExceptionHandlerExceptionResolver.doResolveHandlerMethodException(ExceptionHandlerExceptionResolver.java:408)
    at org.springframework.web.servlet.handler.AbstractHandlerMethodExceptionResolver.doResolveException(AbstractHandlerMethodExceptionResolver.java:61)
    at org.springframework.web.servlet.handler.AbstractHandlerExceptionResolver.resolveException(AbstractHandlerExceptionResolver.java:140)
    at org.springframework.web.servlet.handler.HandlerExceptionResolverComposite.resolveException(HandlerExceptionResolverComposite.java:79)
    at org.springframework.web.servlet.DispatcherServlet.processHandlerException(DispatcherServlet.java:1298)
    at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1110)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1056)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:942)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:1005)
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:897)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:882)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:320)
    at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.invoke(FilterSecurityInterceptor.java:127)
    at org.springframework.security.web.access.intercept.FilterSecurityInterceptor.doFilter(FilterSecurityInterceptor.java:91)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.access.ExceptionTranslationFilter.doFilter(ExceptionTranslationFilter.java:119)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.session.SessionManagementFilter.doFilter(SessionManagementFilter.java:137)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.authentication.AnonymousAuthenticationFilter.doFilter(AnonymousAuthenticationFilter.java:111)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.servletapi.SecurityContextHolderAwareRequestFilter.doFilter(SecurityContextHolderAwareRequestFilter.java:170)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.savedrequest.RequestCacheAwareFilter.doFilter(RequestCacheAwareFilter.java:63)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.authentication.logout.LogoutFilter.doFilter(LogoutFilter.java:116)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.header.HeaderWriterFilter.doFilterInternal(HeaderWriterFilter.java:74)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:118)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.context.SecurityContextPersistenceFilter.doFilter(SecurityContextPersistenceFilter.java:105)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.context.request.async.WebAsyncManagerIntegrationFilter.doFilterInternal(WebAsyncManagerIntegrationFilter.java:56)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:118)
    at org.springframework.security.web.FilterChainProxy$VirtualFilterChain.doFilter(FilterChainProxy.java:334)
    at org.springframework.security.web.FilterChainProxy.doFilterInternal(FilterChainProxy.java:215)
    at org.springframework.security.web.FilterChainProxy.doFilter(FilterChainProxy.java:178)
    at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:357)
    at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:270)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
    at org.springframework.web.filter.CharacterEncodingFilter.doFilterInternal(CharacterEncodingFilter.java:200)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:118)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1700)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1667)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.server.Server.handle(Server.java:505)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
    at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
    at java.base/java.lang.Thread.run(Thread.java:834)
cmark commented 4 years ago

Hi @ivelina-yordanova,

After the fixes, it works fine at my end using the full INT + UK data. The request returns 19 mapping reference sets as FHIR ConceptMaps in about 50ms. Executed in 100 times via a script, but could not find any issue that is related to the error above. It looks like the HTTP connection closes before the write completes, probably becaue you cancel the HTTP request. I will try to come up with something that removes the unnecessary Broken pipe errors from the log output.

Cheers, Mark

ivelina-yordanova commented 4 years ago

Hi @cmark ,

So those exceptions are happening randomly, not with every request. Sometimes it will throw the exception and return 504, sometimes exception + 500. Most of the times it would just time out with without an exception. Other small percentage of the times it will actually return 200, but very very very slow, the requests take avg 59.11sec. Our data has 177 ConceptMaps so maybe the amount of data is what trips it.

Regards, Ivelina

cmark commented 4 years ago

Hi @ivelina-yordanova,

Even with the latest changes on 7.x branch?

Cheers, Mark

ivelina-yordanova commented 4 years ago

Think so, the version is 7.3.0.202001081255, the one after the NPE one

cmark commented 4 years ago

Could you please try again with the latest version (7.3.0.202001081522)? That version has some fixes regarding logging and error reporting (on top of the OOME fix).

cmark commented 4 years ago

Also, could you please share how do you execute those HTTP requests? By hand, using the Fetch API or some other JS library (eg. axios)?

ivelina-yordanova commented 4 years ago

I just updated and the behaviour appears to be same on first sight, I mean times out then returns OK, then times out...etc very random. I have not yet seen Exception though.

I started by using swagger but it's very slow and freezes so eventually, the whole day today and yesterday I've been using postman.

cmark commented 4 years ago

Hmm, that almost 60 seconds avg response time might be indicating the root cause of the randomness issue. The default async timeout for HTTP requests I think set to 60s, and if the response makes it to the HTTP layer before the timeout happens then it succeeds and writes the response back otherwise fails with an async timeout exception. You might want to increase the jetty.http.idleTimeout property in the jetty-http.xml file under the configuration directory to more than the current 60000 value.

Regards, Mark

ivelina-yordanova commented 4 years ago

Hi @cmark,

It is slow but don't think we want to increase the idleTimeout for now.

Thanks for fixing the other issues so quickly though. It's now at least safe to use the endpoint, even if its performance is far from ideal.

It sort of got lost in-between the other messages but is there anything you could do about the metrics endpoint thing?

Regards, Ivelina

cmark commented 4 years ago

Hi @ivelina-yordanova,

What would you like to configure exactly in the metrics endpoint? That endpoint provides metrics for a Prometheus DB via the Micrometer API.

Cheers, Mark

ivelina-yordanova commented 4 years ago

Hi @cmark,

The question was whether it's possible to cleanly change the endpoint from admin/stats to /metrics in a way provided by the server and not having to hack a solution.

Thanks

cmark commented 4 years ago

Hi @ivelina-yordanova,

Hmm, at the moment it would be easier to solve this problem in a reverse proxy than in Snow Owl itself. Currently, each API group should have its own path prefix (/admin, /snomed-ct/v3, etc.) otherwise the subpaths will conflict in the underlying Spring handler mapping and cause issues runtime. I recommend redirecting to /admin/stats from a reverse proxy /metrics entry. That solves your problem without any code changes.

Cheers, Mark

ivelina-yordanova commented 4 years ago

Hi @cmark,

Yeah, will do something like that, just wanted to check if you have an easy way of doing it so it's a config change.

Thanks again