Closed LukasSmrcka closed 1 year ago
Hello @LukasSmrcka, did you also notice this behaviour in older versions of the SDK?
I'll try to reproduce and report back.
@LukasSmrcka I tried reproducing this but wasn't able to do so yet.
After hitting an API endpoint returning a ClassPathResource
a few million times over a few minutes I don't see the number of Scope
objects increase in the heap dumps.
Also not able to reproduce with FileSystemResource
, can you please share more details how to reproduce this. Ideally a controller method that helps reproduce the issue.
I understand. I never managed to reproduce it controlled way on our test environments using tools flooding server with requests. It just always happens on production.
One more clue I can add is when we actually not use input stream and convert the file stream to byte array just before returning it from rest controller then there is no leak, it just works. But when streams are used then it is somehow leaking.
Another thing worthy to point out is that first few hours when the service is running all looks good. Heap alocation graph has "saw" look and GC is very stable. After a few hours something suddenly chnages and the heap graph looks weird and it slowly starts to leak.
I know, that this info is not much of use. Just telling these things becauuse someone might got an idea what could be the cause. I will try to bring some more info upcoming hours or days. I will try to dig deeper in memory dumps, hopefully I will find something helpful.
@LukasSmrcka just to make sure, you are not using our File I/O instrumentation, right?
No we don't use it. We use spring boot autoconfiguration with breadcrumbs disabled in application.yml.
@romtsn By the way the setting we used after issues with breadcrumbs is
sentry:
dsn: https://..........
traces-sample-rate: 1.0
enable-user-interaction-breadcrumbs: off
max-breadcrumbs: 1
But I still see on the heap the same count of breadcrumbs instances as sentry Scope instances. Is it possible to turn it off? It looks like only "max-breadcrumbs" is applied but not the "enable-user-interaction-breadcrumbs: off". Or do I understand the config wrong?
enable-user-interaction-breadcrumbs: off
should have no effect on backend at all. What do you mean by "same count of breadcrumbs instances"? You mean CircularFifoQueue
instances?
@LukasSmrcka you could try an older version of the Sentry SDK to see if that still has the problem. Other than that you could check if pushScope
is only called in SentryWebFilter
or also in other places. Maybe there's something else.
An alternative would be to wait for https://github.com/getsentry/sentry-java/issues/2546 but that is only for Spring Boot 3. Not sure you can upgrade easily.
Could the problem be caused by streams not being consumed completely or something similar?
Sorry .. the counts are different. Almost the same. Those are counts of instances of each class in the memory
Ok I will check your suggestions. We had the same issue with Sentry libs version 6.4.1. I have upgraded to the latest because I noticed some leak fixes in changelog
I am currently running instance without Sentry to be 100% sure the Sentry leak is not just secondary effect of something else.
Streams not being consumed completly? It could be an option actually, I can check if there are such cases. Could you please elaborate more? How this could cause the issue?
It's just a thought. Maybe something isn't cleaned up until the stream is consumed or timeouts are too high. I'm not very familiar with WebFlux. Maybe there's some docs for this case. I don't know. May also have to do with backpressure.
Ok. Tomorrow morning I should have some conclusion from the lates version running on prod without Sentry. Then I will try to take closer look on this
The prod server memory does not show any marks of a leak after 20 hours. So Sentry is for sure somehow involved when streams are returned. One more time I wanna point out that when stream is converted to byte array in controller before it is returned to the calling client then there is also no leak. So I am going to take closer look on your suggestions now
I have also checked heapdump for the stream instances. There is literaly zero count of input streams we are using. So it does not look anything is left opened. Do you have any ideas what diagnostics I should add to get some more info?
Can you filter your heapdump objects by io.sentry
and check if there's as many Hubs as there's Scopes and Breadcrumbs? I suspect that somehow hubs are kept alive. If there's many, maybe you can find something that keeps a reference to the hubs and thus prevents them from being garbage collected.
If you could share some more details on what you're doing inside your controllers etc. I can play around locally some more. Like what classes are you using. Where are you reading from. Are you wrapping stuff with reactor Mono/Flux etc.?
The analyzed heapdump has 8 Hub instances, 1 HubAdapter, 1 NoOpHub. ..... It looks like hub.popScope(); is not called by the web filter or maybe hub.pushScope(); is called multiple times. I am currently investigating what would hot publisher behavior do with the WebFilter when there are more subscribers. I believe there is such a case in our app but counts of unreleased stack items dos not match cases when this scenario might happen so I guess it is not the issue.
I guess there is no chance that popScope() would fail, right?
By the way it is Kotlin app. So requests are running in couroutines scope. 99% of requests are just files from disk loaded as stream, wrapped to InputStreamResource and returned by controller
I guess there is no chance that popScope() would fail, right?
Never say never. I'll try some more stuff and see if i can reproduce.
I am trying to get more info about "doFinally" block that is used in SentryWebFilter. Also running a new test on beta environment. I am flooding the server with 10 download requests per second and every download is aborted in the middle by the client app. Based on some info the doFinally might not be called in some situations like this.
I deployed on production very simple WebFilter monitoring count if sumultanios requests. It will show after some time if doFirst and doFinally block behavior is correct for all cases. After some time of running this we should know if the problem is in the reactor or in Sentry library.
override fun filter(serverWebExchange: ServerWebExchange, webFilterChain: WebFilterChain): Mono<Void> {
return webFilterChain
.filter(serverWebExchange)
.doFinally(
Consumer {
requestCounter.decrementAndGet()
})
.doFirst(
Runnable {
requestCounter.incrementAndGet()
})
}
@LukasSmrcka thanks. That's a great idea. Hopefully we can learn something from it.
So I have some results. These methods look (doFirst, doFinally) reliable. After 85K of requests for every doFirst there was allways doFinally. However io.sentry.Stack.StackItem instances is 38K in the memory. And of course all related instaces like Scope, FiFoQueue and so on. So it really looks like there is a situation when the "pop" from the stack fails. Spring boot starter is configuring only the the WebFilter and nothing else, right?
Is there a way how to really disable those breadcrumbs? I have currently only limit on "1" ... would 0 work?
Oh and one more question. Is Sentry for Weblux internally working with RectorContext?
Is there a way how to really disable those breadcrumbs? I have currently only limit on "1" ... would 0 work?
Setting it to 0
doesn't work in the current version.
This should allow you to disable breadcrumbs until we can find a fix:
@Bean
SentryOptions.BeforeBreadcrumbCallback beforeBreadcrumbCallback() {
return new SentryOptions.BeforeBreadcrumbCallback() {
@Override
public @Nullable Breadcrumb execute(@NotNull Breadcrumb breadcrumb, @NotNull Hint hint) {
return null;
}
};
}
Spring boot starter is configuring only the the WebFilter and nothing else, right?
There's more. There's also SentryScheduleHook
, SentryRequestResolver
and SentryWebExceptionHandler
.
Is Sentry for Weblux internally working with RectorContext
In our integration for Spring Boot 2, no. We couldn't get it to work in a way that's easily usable for developers. We're about to merge something that'll use it for Spring Boot 3.0.3+. Do you have a suspicion there?
I'm still unable to reproduce. Will try a few more things.
Think I'm on to something. I can reproduce now even without coroutines etc. Seems like SentryWebFilter
and SentryScheduleHook
don't play well together. We call .pushScope()
on one hub and .popScope()
on a different one (that is a grandchild of the hub we call .pushScope()
on). Working on a fix.
Using this code in controller to reproduce:
@GetMapping("{id}")
Mono<Person> person(@PathVariable Long id) {
// LOGGER.info("Loading person with id={}", id);
// throw new IllegalArgumentException("Something went wrong [id=" + id + "]");
return personService.create(new Person("fir", "la")).flatMap(it -> Mono.error(new RuntimeException("thrown on purpose for webflux")));
}
And then blast it with some requests using wrk -t2 -c100 -d30s http://user:password@localhost:8080/person/91
Will update here once I know more.
According to the reactor context. I have found a place in the code where coroutine context is not passed correctly by my opinion so I was just wondering if this could break your code somehow.
I am also gonna check ou the ScheduleHook as there are some cschedulers running very often in our app.
And I will try to run a new version without breadcrumbs with your suggested config
Are you already using https://github.com/getsentry/sentry-java/blob/29232bd93911d320cb71002752f96560ebde20a9/sentry-kotlin-extensions/src/main/java/io/sentry/kotlin/SentryContext.kt#L12 via the sentry-kotlin-extensions
dependency?
No we don't use this. I guess we should as I am checking the documentation. But it is for covering of coroutine threads by Sentry or do you think it might actually cause the issue?
I don't think it's responsible for the memory issue you're facing but I've got a PR in the works to improve the situation for WebFlux. Need to do some more testing before pushing it. The sentry-kotlin-extensions
dependency should allow you to have correct hub propagation in coroutines, meaning e.g. breadcrumbs end up in the right error. Without it I guess all interactions inside coroutines would leak into other requests etc.
Yes I understand. It looks good. I am gonna use it.
I tried the config above to disable breadcrumbs and behavior is the same. I think the bug you are solving might be the cause as the scheduler is used in the app really heavily. Do you need anything from me now? Or can I just wait for the fix?
Hmm I would've expected it to at least reduce the amount of memory that is consumed. The amount of Stacks, StackItems and Scopes should remain the same. We just merged the fix and are planning a release soon.
The fix doesn't entirely get rid of the problem but should improve the situation by a lot. I'd like to further improve this in a later release by freeing hubs after we think they're no longer needed. At the moment the ThreadLocals keep a reference to the Hub and prevent it from being garbage collected. Currently there may still be one hub per thread including its Stack, StackItems and Scope etc. which stays alive until something else comes along and replaces the hub then allowing it to finally be GCd.
Well ... there are no longer breadcrumbs in the memory so the leak is lower in term of amount of memory. Some leak reductione is there seems to me but I can not tell exactly. 28K StackItems after 17 hours and 145K of requests. Looks better for sure, it progresses slower seems to me but could be just accidental so nothing relevant for any conclusions.
Ok I will try the fix when it is ready. Shouldn't be TreadLocals avoided? Or are you using them somehow together with Reactor context?
We're heavily relying on ThreadLocal
to offer the static Sentry.capture...
API. For Spring Boot 3.0.3 we'll be able to leverage ThreadLocalAccessor
for hub propagation which works better than what he had before (also included in the next release).
@LukasSmrcka 6.15.0
of the Java SDK has just been released. Can you please give it a try and let us know whether the memory issue has been improved?
Sure. On it
It is running now. We will see some results tomorrow
@adinauer After 18 hours of running and 155K of requests there is no Sentry leak. Good job. I think youhave fixed that
That's great to hear. Apologies for the troubles.
Thanks for fixing that. Later I will also give a try to breadcrumbs. Now it is disabled
Played around with my idea for only referencing an ID instead of the whole hub but wasn't able to quickly make it work in all cases (e.g. ThreadLocalAccessor
).
Closing this now as the issue seems resolved. Please feel free to reopen in case you still encounter problems.
@adinauer After half of the milion of requests all is good. Now I have deployed a new version with enabled breadcrumbs. Will let you know if anything goes wrong
Looks like all is good even with breadcrumbs. No marks of leak. Thank you one more time
Integration
sentry-spring-boot-starter
Java Version
11
Version
6.13.1
Steps to Reproduce
Hi, we are using Kotlin based application on Spring Webflux and Kotlin couroutines. Versions: Spring boot: 2.7.0 Spring Webflux: 5.3.20
The appication is basically cache. 99% of all http requests hitting the app are getting file from disk in a response. Everything is streamed. Application is currently handling 10 http requests per second.
The first issue we had were sentry breadcrumbs that we had to dissable because after couple of hours it was able to reach 1GB heap limit we have. After disabling breadcrumbs an issue is still there. I am not sure if it is something different or we just cut the size of leaking objects by disabling breadcrombs. Anyway I can observe in the memory that count of io.sentry.Scope instances and all those referenced objects (ConcurrentHashMaps, ...) is just rising and rising. After 24 hours there is more then 50K of instances of io.sentry.Scope.
When sentry Spring boot package is removed then the memory is stable. It looks like it is somehow related to those file streams because there are no issues with other webflux based appliacations we have. Any idea what could be the issue? How could we help to fix it?
Best regards, Lukas
Expected Result
No memory leak when returning files on http requests
Actual Result
Those strings/byte arrays on top are part of the Scope aggregates