crossminer / scava

https://eclipse.org/scava/
Eclipse Public License 2.0
18 stars 13 forks source link

file/thread/socket descriptors leak #244

Open mhow2 opened 5 years ago

mhow2 commented 5 years ago

As already mentioned in Slack in the general channel:

I'm running the regular docker stack with one slave. Two analysis in // , year 2018, all metrics on projects https://github.com/INRIA/spoon/ and https://gitlab.ow2.org/clif/clif-legacy

Once started, I gather both MP's PIDs 10959 (w1 clif-legacy), 10985 (w2 spoon) and run:

watch 'sudo lsof -p 10959 -p 10985 |wc ; sudo lsof -p 10959 -p 10985 |grep sock |wc ; ps -o thcount 10959 10985'

So I get 1) the number of file descriptors (any kind) 2) the number of sockets descriptors 3) thread count on both.

After some minutes I observe that:

There are 2904 file descriptors opened for both PIDs , 1490 of those are sockets (not bound, not used). w1 gets 13007 threads and w2 381.

If I wait more, it will eventual reach 32000 threads, get java.lang.OutOfMemoryError: unable to create new native thread in the MP's logs and loose any control to the server, getting messages in active SSH session that the system cannot fork new processes.

blueoly commented 5 years ago

If my code was to blame for this issue, then aac36db should fix it. @mhow2 please check it.

mhow2 commented 5 years ago

Unfortunately, it doesn't solve the issue. I'm still hitting several thousand of threads. Did you test it on your side ?

On my side, I did produce a stack trace please have a look.

mhow2 commented 5 years ago

I just reproduced the issue with https://gitlab.ow2.org/clif/clif-legacy you can have a look to this project specifically when testing.

creat89 commented 5 years ago

I was able to reproduce the issue. Here is the things that I used to reproduce it:

flight_recording_180172TheJVMRunningMissionControl_1.zip

EDIT: I'm running the task this time not in debug mode and the day of breaking has changed. I'm past over the 15/03/18

creat89 commented 5 years ago

I was able to reproduce the issue again, but this time in normal mode instead of debug mode. Here is the things that I used to reproduce it:

flight_recording_180172TheJVMRunningMissionControl_2.zip

image

blueoly commented 5 years ago

I took a look at the recordings that Adrian sent. The recordings do not show that there is an excessive use of memory. I could not come to a conclusion.

creat89 commented 5 years ago

I will test today with other readers. Just to know whether the issue comes from the platform or from gitlab reader.

tdegueul commented 5 years ago

Hi everyone,

After closer inspection, I may have found the problem based on two observations:

This means that during Delta creation, the platform is creating thousands of Thread pools that all contain a single thread. Diving a bit into the code, I think the following components (bug tracking systems managers) are affected: BitbucketManager, GitLabManager, EclipseForumsManager.

For instance, have a look at https://github.com/crossminer/scava/blob/dev/metric-platform/platform-extensions/org.eclipse.scava.platform.bugtrackingsystem.gitlab/src/org/eclipse/scava/platform/bugtrackingsystem/gitlab/GitLabManager.java#L381-L382, which is invoked when creating the Deltas:

ScheduledExecutorService newClientService = Executors.newSingleThreadScheduledExecutor();
newClientService.scheduleAtFixedRate(runnable, 1, 1, TimeUnit.MINUTES);

This thread pool does not terminate. It creates a single-thread pool that indefinitely runs every minute. I am not sure why this is needed in the first place (is this some kind of 'retry' mechanism? why does it only start after 1 minute and not right away?). So we end up stacking pools of threads that are never closed.

This code is copy/pasted in other bug tracking managers. @creat89 do you think this might be the issue? If so, may you have a look and implement a mechanism to somehow shutdown the pool, or just remove it and do the same thing without relying on threads?

The GitHub manager doesn't reuse the same code so it shouldn't be affected. @mhow2 is it? If so, there might be another issue to fix.

There are other strange things going on in the profiler, but one step at a time ;)

creat89 commented 5 years ago

Hello @tdegueul, I cannot tell exactly as I'm not the developer of that code, it is @Danny2097. I agree that the thread should be terminated and yes, the goal is to wait until we can have more calls to do the requests. (@Danny2097?) The tests have been done using OW2's Gitlab server, which do not set any type of limit in the number of requests, thus, this shouldn't affect (I guess). With regards GitHub, according to @mhow2, this has been observed also in the GitHub reader. But I haven't arrived to test enough GitHub (as I get an error but from RESTMULE). I'll try again today.

This morning I tested with Bugzilla and Redmine, and nothing similar was found. However, I was thinking that maybe the projects were small to reproduce the error. I tried to use the NTTP reader, but the channels that are large enough are taking too long even to create the first delta.

tdegueul commented 5 years ago

I'm not sure when the code is actually invoked.

The method public void setClient(GitLabTracker gitlabTracker) throws IOException { is kinda weird. The method recursively calls itself (?) through the use of threads (?) (ie. creates a Runnable that calls the method that creates the Runnable), and, if there is no "too many requests" (open_id.contains("Too Many Requests") == false), runs this Runnable periodically.

I have a very hard time understanding what it does, and indeed more tests are needed to understand if this is really what's happening. But I'm pretty sure there is something wrong here. Under certain circumstances (no idea which -- maybe when there's a "too many requests" happening), you may end up having thousands of threads triggered at a given moment, which will create thousands of threads in turn since they call themselves, etc.

Danny2097 commented 5 years ago

The idea behind the scheduled executer is to provide a mechanism to the readers which when the call counts reach zero or when a Oauth token expires it will pause until the either a new token is generated or the call counter resets.

I am open to other suggestions to implement this kind of functionality into the readers.

Danny2097 commented 5 years ago

Following discussions with @tdegueul and @creat89 I have removed the problematic code from the following readers gitlab bitbucket in commits eb506921c55cddb592ee1933d470a006d6297731, 8b232557ccb488be17554eb9dcfca25b7c9927bb, d68be58097cbeb634b670ed6defda0d08fb72d07 . We are currently working on a fix for eclipse forums as this supports OAuth authentication.

tdegueul commented 5 years ago

It seems this was the issue and clif-legacy isn't running wild on my machine anymore... Please let us know @mhow2!

mhow2 commented 5 years ago

Today I have deployed the new build and tried twice on the same project clif-legacy, whole 2018, all metrics.

The first time I got crazy a load average (like 100) so I stopped the platform.

Now that I'm running the second try, it is not "wild" but I can tell for sure there is still an issue somewhere. Currently at 63% of the analysis, there are about 4000 threads running on the server and around 1700 sockets opened by the related java process.

creat89 commented 5 years ago

@mhow2 Which metrics are you running? Are you using all the metrics?

mhow2 commented 5 years ago

All metrics. I got clif-legacy analysis to finish ! ... Well it's finished but I still see 2700 sockets open and about 5000 threads.

creat89 commented 5 years ago

hello @mhow2, I have pushed a commit that might be also related to this issue: https://github.com/crossminer/scava/commit/feecc8d1c4437d2e3668dde94ebea5e6707f013b

Because the indexing metrics were being executed at the same time than the transient metrics, instead at the end. Le me know if you see any changes.

blueoly commented 5 years ago

Hello. I just faced the following error, while I run all the metrics for a period for around a month at this repo https://github.com/ruby/ruby .

https://gist.github.com/blueoly/5d8d8d77ffe9b8a7e216a045dc74456c

creat89 commented 5 years ago

@blueoly, I had that issue but here https://github.com/crossminer/scava-deployment/issues/63. I'm not sure it's related to this issue. I thought that it was related to using a local generated version of the metric platform.

mhow2 commented 5 years ago

I'm just posting this two pics to illustrate the fact that the thread/memory allocation is growing even after the task (on clif-legacy) is reported as finished in the admin UI. threads-pinpoint=1559283364,1559553796 memory-pinpoint=1559285956,1559546884

blueoly commented 5 years ago

Hi @creat89 ,

I saw your issue but in my case I noticed a memory related warning before the error stack trace and this is the reason I put it here.

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000678f00000, 665321472, 0) failed; error='Cannot allocate memory' (errno=12)

I do not know if this is the case for you too.

tdegueul commented 5 years ago

@mhow2 @blueoly If you're using the latest version of scava-deployment, the port 8192 should now be exposed by oss-app to enable remote debugging and profiling. Do you have public instances of Scava where the problem occurs and the port 8192 is exposed that I could connect to?

blueoly commented 5 years ago

It was not a public instance. I will ask my admin to open that port and I will come back to you.

tdegueul commented 5 years ago

I forgot: first, you need to update the -Djava.rmi.server.hostname=localhost in https://github.com/crossminer/scava-deployment/blob/dev/metric-platform/Dockerfile with your own public hostname.

creat89 commented 5 years ago

Yes, in my case it works. I guess I used the wrong port on Saturday

mhow2 commented 5 years ago

So, deployed again this morning and running the analysis again on https://gitlab.ow2.org/clif/clif-legacy, all metrics, year 2018

The analysis hang at 7% on historic metric. JMX is available at scava-dev.ow2.org:8192

mhow2 commented 5 years ago

So, deployed again this morning and running the analysis again on https://gitlab.ow2.org/clif/clif-legacy, all metrics, year 2018

The analysis hang at 7% on historic metric. JMX is available at scava-dev.ow2.org:8192

This observation doesn't seem related with the current leak issue. Please ignore it.