Open mhow2 opened 5 years ago
If my code was to blame for this issue, then aac36db should fix it. @mhow2 please check it.
Unfortunately, it doesn't solve the issue. I'm still hitting several thousand of threads. Did you test it on your side ?
On my side, I did produce a stack trace please have a look.
I just reproduced the issue with https://gitlab.ow2.org/clif/clif-legacy you can have a look to this project specifically when testing.
I was able to reproduce the issue. Here is the things that I used to reproduce it:
org.eclipse.scava.metricprovider.trans.plaintextprocessing.PlainTextProcessingTransMetricProvider
as Martin indicated us that the metrics were not the issue.org.eclipse.jdt.internal.debug.core.model 1,858 77.611%
flight_recording_180172TheJVMRunningMissionControl_1.zip
EDIT: I'm running the task this time not in debug mode and the day of breaking has changed. I'm past over the 15/03/18
I was able to reproduce the issue again, but this time in normal mode instead of debug mode. Here is the things that I used to reproduce it:
I took a look at the recordings that Adrian sent. The recordings do not show that there is an excessive use of memory. I could not come to a conclusion.
I will test today with other readers. Just to know whether the issue comes from the platform or from gitlab reader.
Hi everyone,
After closer inspection, I may have found the problem based on two observations:
pool-X-thread-1
, with X
reaching the tens of thousandsThis means that during Delta creation, the platform is creating thousands of Thread pools that all contain a single thread. Diving a bit into the code, I think the following components (bug tracking systems managers) are affected: BitbucketManager
, GitLabManager
, EclipseForumsManager
.
For instance, have a look at https://github.com/crossminer/scava/blob/dev/metric-platform/platform-extensions/org.eclipse.scava.platform.bugtrackingsystem.gitlab/src/org/eclipse/scava/platform/bugtrackingsystem/gitlab/GitLabManager.java#L381-L382, which is invoked when creating the Deltas:
ScheduledExecutorService newClientService = Executors.newSingleThreadScheduledExecutor();
newClientService.scheduleAtFixedRate(runnable, 1, 1, TimeUnit.MINUTES);
This thread pool does not terminate. It creates a single-thread pool that indefinitely runs every minute. I am not sure why this is needed in the first place (is this some kind of 'retry' mechanism? why does it only start after 1 minute and not right away?). So we end up stacking pools of threads that are never closed.
This code is copy/pasted in other bug tracking managers. @creat89 do you think this might be the issue? If so, may you have a look and implement a mechanism to somehow shutdown the pool, or just remove it and do the same thing without relying on threads?
The GitHub manager doesn't reuse the same code so it shouldn't be affected. @mhow2 is it? If so, there might be another issue to fix.
There are other strange things going on in the profiler, but one step at a time ;)
Hello @tdegueul, I cannot tell exactly as I'm not the developer of that code, it is @Danny2097. I agree that the thread should be terminated and yes, the goal is to wait until we can have more calls to do the requests. (@Danny2097?) The tests have been done using OW2's Gitlab server, which do not set any type of limit in the number of requests, thus, this shouldn't affect (I guess). With regards GitHub, according to @mhow2, this has been observed also in the GitHub reader. But I haven't arrived to test enough GitHub (as I get an error but from RESTMULE). I'll try again today.
This morning I tested with Bugzilla and Redmine, and nothing similar was found. However, I was thinking that maybe the projects were small to reproduce the error. I tried to use the NTTP reader, but the channels that are large enough are taking too long even to create the first delta.
I'm not sure when the code is actually invoked.
The method public void setClient(GitLabTracker gitlabTracker) throws IOException {
is kinda weird. The method recursively calls itself (?) through the use of threads (?) (ie. creates a Runnable
that calls the method that creates the Runnable
), and, if there is no "too many requests" (open_id.contains("Too Many Requests") == false
), runs this Runnable
periodically.
I have a very hard time understanding what it does, and indeed more tests are needed to understand if this is really what's happening. But I'm pretty sure there is something wrong here. Under certain circumstances (no idea which -- maybe when there's a "too many requests" happening), you may end up having thousands of threads triggered at a given moment, which will create thousands of threads in turn since they call themselves, etc.
The idea behind the scheduled executer is to provide a mechanism to the readers which when the call counts reach zero or when a Oauth token expires it will pause until the either a new token is generated or the call counter resets.
I am open to other suggestions to implement this kind of functionality into the readers.
Following discussions with @tdegueul and @creat89 I have removed the problematic code from the following readers gitlab
bitbucket
in commits eb506921c55cddb592ee1933d470a006d6297731, 8b232557ccb488be17554eb9dcfca25b7c9927bb, d68be58097cbeb634b670ed6defda0d08fb72d07 . We are currently working on a fix for eclipse forums
as this supports OAuth authentication.
It seems this was the issue and clif-legacy
isn't running wild on my machine anymore... Please let us know @mhow2!
Today I have deployed the new build and tried twice on the same project clif-legacy
, whole 2018, all metrics.
The first time I got crazy a load average (like 100) so I stopped the platform.
Now that I'm running the second try, it is not "wild" but I can tell for sure there is still an issue somewhere. Currently at 63% of the analysis, there are about 4000 threads running on the server and around 1700 sockets opened by the related java process.
@mhow2 Which metrics are you running? Are you using all the metrics?
All metrics. I got clif-legacy analysis to finish ! ... Well it's finished but I still see 2700 sockets open and about 5000 threads.
hello @mhow2, I have pushed a commit that might be also related to this issue: https://github.com/crossminer/scava/commit/feecc8d1c4437d2e3668dde94ebea5e6707f013b
Because the indexing metrics were being executed at the same time than the transient metrics, instead at the end. Le me know if you see any changes.
Hello. I just faced the following error, while I run all the metrics for a period for around a month at this repo https://github.com/ruby/ruby .
https://gist.github.com/blueoly/5d8d8d77ffe9b8a7e216a045dc74456c
@blueoly, I had that issue but here https://github.com/crossminer/scava-deployment/issues/63. I'm not sure it's related to this issue. I thought that it was related to using a local generated version of the metric platform.
I'm just posting this two pics to illustrate the fact that the thread/memory allocation is growing even after the task (on clif-legacy
) is reported as finished in the admin UI.
Hi @creat89 ,
I saw your issue but in my case I noticed a memory related warning before the error stack trace and this is the reason I put it here.
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000678f00000, 665321472, 0) failed; error='Cannot allocate memory' (errno=12)
I do not know if this is the case for you too.
@mhow2 @blueoly If you're using the latest version of scava-deployment
, the port 8192 should now be exposed by oss-app
to enable remote debugging and profiling. Do you have public instances of Scava where the problem occurs and the port 8192 is exposed that I could connect to?
It was not a public instance. I will ask my admin to open that port and I will come back to you.
I forgot: first, you need to update the -Djava.rmi.server.hostname=localhost
in https://github.com/crossminer/scava-deployment/blob/dev/metric-platform/Dockerfile with your own public hostname.
Yes, in my case it works. I guess I used the wrong port on Saturday
So, deployed again this morning and running the analysis again on https://gitlab.ow2.org/clif/clif-legacy, all metrics, year 2018
The analysis hang at 7% on historic
metric. JMX is available at scava-dev.ow2.org:8192
So, deployed again this morning and running the analysis again on https://gitlab.ow2.org/clif/clif-legacy, all metrics, year 2018
The analysis hang at 7% on
historic
metric. JMX is available at scava-dev.ow2.org:8192
This observation doesn't seem related with the current leak issue. Please ignore it.
As already mentioned in Slack in the general channel:
I'm running the regular docker stack with one slave. Two analysis in // , year 2018, all metrics on projects https://github.com/INRIA/spoon/ and https://gitlab.ow2.org/clif/clif-legacy
Once started, I gather both MP's PIDs 10959 (w1 clif-legacy), 10985 (w2 spoon) and run:
watch 'sudo lsof -p 10959 -p 10985 |wc ; sudo lsof -p 10959 -p 10985 |grep sock |wc ; ps -o thcount 10959 10985'
So I get 1) the number of file descriptors (any kind) 2) the number of sockets descriptors 3) thread count on both.
After some minutes I observe that:
There are 2904 file descriptors opened for both PIDs , 1490 of those are sockets (not bound, not used). w1 gets 13007 threads and w2 381.
If I wait more, it will eventual reach 32000 threads, get
java.lang.OutOfMemoryError: unable to create new native thread
in the MP's logs and loose any control to the server, getting messages in active SSH session that the system cannot fork new processes.