eclipse-researchlabs / scava

Eclipse Public License 2.0
7 stars 1 forks source link

stuck tasks : need people to test reproducing #31

Open davidediruscio opened 4 years ago

davidediruscio commented 4 years ago

It seems I'm lucky enough with the platform to have task that hangs at some point. The issue happen very random, and what I see from the platform with the steps below is several git pull are spawned. It looks weird, at least to me, it feels like the process isn't thread safe. But it can also be a matter with the server itself....

To try to reproduce the thing I'm observing with https://gitlab.ow2.org/clif/clif-legacy, you need:

0) add a gitlabToken property (see pinned message in #general in slack) 1) import https://gitlab.ow2.org/clif/clif-legacy 2) task 30/01/2018 to 31/01/2018 all metrics 3) run the task and ps axf within the container (or host) seeking for git commands 4) you might see an orphan git pull and the task to freeze

If it doesn't freeze, loop over the testing again : 1) delete the task 2) rm /root/scava/cliflegacy -rf within the container 3) restart from 1) at the reproducing steps

===

I'm quoting @tdegueul with his initial finding that led to better reproduce this bug

The blocking thread is :

"pool-38-thread-5" - Thread t@385
  java.lang.Thread.State: WAITING
       at java.lang.Object.wait(Native Method)
       - waiting on <757cba92> (a java.lang.UNIXProcess)
       at java.lang.Object.wait(Object.java:502)
       at java.lang.UNIXProcess.waitFor(UNIXProcess.java:395)
       at org.eclipse.scava.platform.vcs.workingcopy.git.GitWorkingCopyManager.checkout(GitWorkingCopyManager.java:44)
       at org.eclipse.scava.platform.vcs.workingcopy.manager.WorkingCopyFactory.checkout(WorkingCopyFactory.java:91)
       at org.eclipse.scava.metricprovider.rascal.RascalMetricProvider.computeFolders(RascalMetricProvider.java:399)
       at org.eclipse.scava.metricprovider.rascal.RascalMetricProvider.compute(RascalMetricProvider.java:251)
       - locked <21de2b87> (a java.lang.Class)
       at org.eclipse.scava.metricprovider.rascal.RascalMetricProvider.measure(RascalMetricProvider.java:213)
       at org.eclipse.scava.metricprovider.rascal.RascalMetricProvider.measure(RascalMetricProvider.java:1)
       at org.eclipse.scava.platform.osgi.analysis.MetricListExecutor.run(MetricListExecutor.java:103)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)

the checkout() waits for a UNIX process the related code is : Process p = Runtime.getRuntime().exec(new String[] { "git", "pull"}, new String[] { }, workingDirectory);

davidediruscio commented 4 years ago

Please see whether 213db8a1623436fe708cf9f250b2dd5752188651 has any influence.

davidediruscio commented 4 years ago

I detect the same problem, below the steps I did:

the task doesn't seem to be blocked from the logs, but the analysis seems to say the opposite (see figure below)

oss-app_1         | Jun 06, 2019 6:46:19 AM org.restlet.engine.log.LogFilter afterHandle
oss-app_1         | INFO: 2019-06-06    06:46:19    172.19.0.9    -    172.19.0.6    8182    GET    /analysis/tasks/project/cliflegacy    -    200    27198    0    54    http://oss-app:8182    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36    http://localhost:5601/
dashb-importer_1  | 2019-06-06 06:46:20,959 500 items uploaded to ES (https://admin:admin@elasticsearch:9200/scava-metrics/items/_bulk)
oss-app_1         | Jun 06, 2019 6:46:21 AM org.restlet.engine.log.LogFilter afterHandle

Captura de pantalla de 2019-06-05 17-51-03

davidediruscio commented 4 years ago

Please see whether 213db8a has any influence.

Doesn't seem to on my side.

davidediruscio commented 4 years ago

I've been running a task over 11 months (2018), all metrics, with project https://github.com/INRIA/spoon and haven't encountered the issue yet. So maybe it's more related to gitlab projects ?

davidediruscio commented 4 years ago

Please, is someone can have a look ? @ambpro ?

davidediruscio commented 4 years ago

@davidediruscio I'll try to reproduce the issue and help, but I'm not charged of the git readers, which seems to be the cause of the issue. EHU uses the REST API to access the issues from Gitlab.

davidediruscio commented 4 years ago

I've been trying to reproduce the issue with other Gitlab repositories but I don't see the issue, and I couldn't see any part of the workingcopy code where could it be an issue.

davidediruscio commented 4 years ago

Thanks @creat89 . Notice that the procedure above doesn't make the issue to happen for every try. You may have to run it several times. Please @MarcioMateus , @borisbaldassari , could you try ? It would just take about 5min of your time. Thanks.

davidediruscio commented 4 years ago

@blueoly , could you help to find what is the issue ?

davidediruscio commented 4 years ago

@blueoly, @tdegueul, I have some question.

davidediruscio commented 4 years ago

@mhow2 Martin, I will try to reproduce it tonight and come back with any updates.

@creat89 There is no chance to request many commits from the same day. My metrics and I am pretty sure that @tdegueul metrics use only the last commit of each day.

I am not sure but I believe that requesting the same commit multiple times is not a problem because this commit will be already checked out form the first request and a new checkout will not do anything but inform the repo that the commit is ready.

In the same fashion, I think that the checks that you suggest are more "heavy" than a plain git pull. If the commit is already pulled, nothing is going to happen (not exactly nothing, a check will be made to check if the repo is up to date). If the commit is not there, a pull will be executed but it will be executed once for each commit. I am not a git expert but I feel that this is not something that will create a considerable overhead or reasons to stuck the platform. And if this was related with git pull, it should affect not only gitlab projects but also github.

davidediruscio commented 4 years ago

We also add the --verbose flag in the git pull command to see when and if it gets stuck. Its output will be visible through the platform logging, right?

davidediruscio commented 4 years ago

I did some testing on this case. I saw a behaviour similar to the that one that @valeriocos reported. However, in one run the task crashed (at -68% progress)

I saw the following exception on oss-app log:

Exception in thread "pool-9-thread-5" java.lang.StackOverflowError
    at java.lang.Class.searchMethods(Class.java:3003)
    at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
    at java.lang.Class.getMethod0(Class.java:3018)
    at java.lang.Class.getMethod(Class.java:1784)
    at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:847)
    at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:817)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:814)
    at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:421)
    at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:657)
    at org.apache.http.conn.ssl.DefaultHostnameVerifier.<init>(DefaultHostnameVerifier.java:72)
    at org.apache.http.impl.nio.client.HttpAsyncClientBuilder.build(HttpAsyncClientBuilder.java:661)
    at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:216)
    at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:213)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.elasticsearch.client.RestClientBuilder.createHttpClient(RestClientBuilder.java:213)
    at org.elasticsearch.client.RestClientBuilder.access$000(RestClientBuilder.java:42)
    at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:186)
    at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:183)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.elasticsearch.client.RestClientBuilder.build(RestClientBuilder.java:183)
    at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:181)
    at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)
    at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)

After this I saw multiple repetitions of the last line.

This this ring any bell? I am not sure if this behaviour is related with the problem inicial problem reported by @mhow2 or is a different one.

davidediruscio commented 4 years ago

Hi @MarcioMateus, myself and @creat89 do not think that https://github.com/crossminer/scava/issues/276#issuecomment-502269094 is related to this issue (#276). However, please provide us with some further information surrounding the stack trace and we can look into it further.

davidediruscio commented 4 years ago

I did a new test. The task froze for some minutes at -100% (minus!) but started and then crashed on at -22%.

You can see below the stack trace of the metrics. Several metrics presented some kind of error. stack-trace.json.txt

I still see similar error on oss-app:

oss-app_1         | Exception in thread "pool-9-thread-3" INFO  [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Starting execution AnalysisTask 'cliflegacy:task4' with MetricExecution 'trans.rascal.OO.java.A-Java'
oss-app_1         | INFO  [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Adding dependencies to metricProvider 'trans.rascal.OO.java.A-Java' into project 'cliflegacy'
oss-app_1         | INFO  [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Added dependencies to metricProvider 'trans.rascal.OO.java.A-Java' into project 'cliflegacy'
oss-app_1         | INFO  [AnalysisSchedulingService] (15:13:48): Starting MetricExecution 'trans.rascal.OO.java.A-Java'
oss-app_1         | INFO  [AnalysisSchedulingService] (15:13:48): Update the worker 'w1' heartBeat Mon Jun 17 15:13:48 UTC 2019'
oss-app_1         | INFO  [AnalysisSchedulingService] (15:13:48): Starting MetricExecution 'trans.rascal.OO.java.A-Java' is done.
oss-app_1         | INFO  [AnalysisSchedulingService] (15:13:48): Starting find out MetricExecution 'trans.rascal.OO.java.A-Java' on Project 'cliflegacy'
oss-app_1         | INFO  [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Starting Metric Execution (A_Java).
oss-app_1         | java.lang.StackOverflowError
oss-app_1         |     at java.lang.Class.searchMethods(Class.java:3003)
oss-app_1         |     at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
oss-app_1         |     at java.lang.Class.getMethod0(Class.java:3018)
oss-app_1         |     at java.lang.Class.getMethod(Class.java:1784)
oss-app_1         |     at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:847)
oss-app_1         |     at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:817)
oss-app_1         |     at java.security.AccessController.doPrivileged(Native Method)
oss-app_1         |     at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:814)
oss-app_1         |     at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:421)
oss-app_1         |     at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:657)
oss-app_1         |     at org.apache.http.impl.nio.client.MainClientExec.<init>(MainClientExec.java:82)
oss-app_1         |     at org.apache.http.impl.nio.client.HttpAsyncClientBuilder.build(HttpAsyncClientBuilder.java:847)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:216)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:213)
oss-app_1         |     at java.security.AccessController.doPrivileged(Native Method)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder.createHttpClient(RestClientBuilder.java:213)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder.access$000(RestClientBuilder.java:42)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:186)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:183)
oss-app_1         |     at java.security.AccessController.doPrivileged(Native Method)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder.build(RestClientBuilder.java:183)
oss-app_1         |     at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:181)
oss-app_1         |     at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)
oss-app_1         |     at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)

But before this error, if find multiple times this message:

oss-app_1         | INFO  [Indexer Singleton] (15:13:47): Client Connection Failure
oss-app_1         | INFO  [Indexer Singleton] (15:13:47): Client Connection Failure
oss-app_1         | INFO  [Indexer Singleton] (15:13:47): java.io.IOException: Connection refused

I don't know which client is having problems...

Let me know if you want me to do any specific test.

davidediruscio commented 4 years ago

hi @MarcioMateus I don't think the issue you are reporting is the same, believing the observations @valeriocos and myself made (stuck git pull processes). I think you could open another issue report for yours. However, if you can try to reproduce the steps we have described, that would be helpful.

davidediruscio commented 4 years ago

@MarcioMateus Open another issue for relating to the client and I will look into it.

davidediruscio commented 4 years ago

This is a profiler snapshot for VisualVM, shot at the time the issue happens

stucktask-cliflegacy.nps.gz

davidediruscio commented 4 years ago

The same happened to me on project https://github.com/RocketChat/Rocket.Chat so it's not especially related to gitlab

davidediruscio commented 4 years ago

I guess that in order to improve the debugging of this issue, it is necessary to make more verbose when it is being called the pull commands in each project. Apart from the method that downloads a local copy of a repository and calls pull every checkout, is there any other metric doing pulls? This is in order to focus only on the readers and metrics that could trigger this issue, instead of running all the metrics.

davidediruscio commented 4 years ago

+1 @creat89

BTW I bumped into the issue with project https://github.com/RocketChat/Rocket.Chat (all metrics enabled). It's stuck on:

    13 ?        Sl     2:29 /usr/bin/java -Xmx8g -Dcom.sun.management.jmxremote.rmi.port=8192 -Dcom.sun.management.jmxremote=tr
  470 ?        S      0:00  \_ git pull
  539 ?        S      0:00      \_ git merge FETCH_HEAD

The last line in metric-platform logs is:

INFO [MetricListExecutor (RocketChat, 20180102)] (09:34:27): Ending execution AnalysisTask 'RocketChat:task' with MetricExecution 'style.spreadOfStyleViolations.historic' done in 9 ms

davidediruscio commented 4 years ago

What I can say is the metric providers in the original description of #312 don't look concerned.

davidediruscio commented 4 years ago

After running project https://github.com/chaoss/augur with all metrics, to address https://github.com/crossminer/scava/issues/328, I runned ps axf within the oss-app container and got the following output

root@dfe65d594f34:/ossmeter# ps axf
  PID TTY      STAT   TIME COMMAND
19304 pts/2    Ss     0:00 bash
19314 pts/2    R+     0:00  \_ ps axf
18405 pts/1    Ss+    0:00 bash
20434 pts/0    Ss+    0:00 bash
    1 ?        Ss     0:00 ./eclipse -master -apiServer -worker w1 -config prop.properties
   13 ?        Sl   117:44 /usr/local/openjdk-8/bin/java -Xmx8g -Dcom.sun.management.jmxremote.rmi.port=8192 -Dcom.sun.management.jmxremote=true -Dcom.su
  491 ?        Z      0:00 [git-remote-http] <defunct>
  494 ?        Z      0:00 [git-remote-http] <defunct>
  499 ?        Z      0:00 [git-remote-http] <defunct>
  511 ?        Z      0:00 [git-remote-http] <defunct>
  512 ?        Z      0:00 [git-remote-http] <defunct>
  513 ?        Z      0:00 [git-remote-http] <defunct>
  520 ?        Z      0:00 [git-remote-http] <defunct>
 6081 ?        Z      0:00 [git-remote-http] <defunct>
 6082 ?        Z      0:00 [git-remote-http] <defunct>
 6083 ?        Z      0:00 [git-remote-http] <defunct>
 6086 ?        Z      0:00 [git-remote-http] <defunct>
 6087 ?        Z      0:00 [git-remote-http] <defunct>
14893 ?        Z      0:00 [git-remote-http] <defunct>
14894 ?        Z      0:00 [git-remote-http] <defunct>
14897 ?        Z      0:00 [git-remote-http] <defunct>
14898 ?        Z      0:00 [git-remote-http] <defunct>
14899 ?        Z      0:00 [git-remote-http] <defunct>
14900 ?        Z      0:00 [git-remote-http] <defunct>
14910 ?        Z      0:00 [git-remote-http] <defunct>

It seems we are still not properly closing some of the git connections.

davidediruscio commented 4 years ago

Hello everybody,

I found this bug in Eclipse Issue tracker:

https://bugs.eclipse.org/bugs/show_bug.cgi?id=545936

In summary, it says that jgit (the library behind doing the git pull in CROSSMINER) has issues when it is called multiple times if the Java version is the 11. @MarcioMateus and @mhow2, which version of Java are you using?

davidediruscio commented 4 years ago

HI @creat89 ,

We are running the the metric-platform on a docker container based on openjdk:8-jdk, so JDK 8.

davidediruscio commented 4 years ago

I have an intersting case today: root@75a55e4531a2:/ossmeter# ps aux

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  25136   104 ?        Ss   Sep19   0:00 ./eclipse -worker w2 -config prop.properties
root        94 24.0  6.7 15073780 2194452 ?    Sl   Sep19 347:06 /usr/local/openjdk-8/bin/java -Xmx8g -Dcom.sun.management.jmx
root     28050  0.0  0.0 258936  2264 ?        S    Sep19   0:00 git checkout 7850b4a2dd3d2fbd84f5b08c8077c045bf7a3265
root     28130  1.0  0.0  19948  3532 pts/0    Ss   09:54   0:00 bash
root     28135  0.0  0.0  38380  2956 pts/0    R+   09:54   0:00 ps aux

As you see the process git checkout is stuck since yesterday. Below is the related worker:

image

And the container logs: oss-slave-container.log.gz