stuck tasks : need people to test reproducing

davidediruscio commented 4 years ago

It seems I'm lucky enough with the platform to have task that hangs at some point. The issue happen very random, and what I see from the platform with the steps below is several git pull are spawned. It looks weird, at least to me, it feels like the process isn't thread safe. But it can also be a matter with the server itself....

To try to reproduce the thing I'm observing with https://gitlab.ow2.org/clif/clif-legacy, you need:

0) add a gitlabToken property (see pinned message in #general in slack) 1) import https://gitlab.ow2.org/clif/clif-legacy 2) task 30/01/2018 to 31/01/2018 all metrics 3) run the task and ps axf within the container (or host) seeking for git commands 4) you might see an orphan git pull and the task to freeze

If it doesn't freeze, loop over the testing again : 1) delete the task 2) rm /root/scava/cliflegacy -rf within the container 3) restart from 1) at the reproducing steps

===

I'm quoting @tdegueul with his initial finding that led to better reproduce this bug

The blocking thread is :

"pool-38-thread-5" - Thread t@385
  java.lang.Thread.State: WAITING
       at java.lang.Object.wait(Native Method)
       - waiting on <757cba92> (a java.lang.UNIXProcess)
       at java.lang.Object.wait(Object.java:502)
       at java.lang.UNIXProcess.waitFor(UNIXProcess.java:395)
       at org.eclipse.scava.platform.vcs.workingcopy.git.GitWorkingCopyManager.checkout(GitWorkingCopyManager.java:44)
       at org.eclipse.scava.platform.vcs.workingcopy.manager.WorkingCopyFactory.checkout(WorkingCopyFactory.java:91)
       at org.eclipse.scava.metricprovider.rascal.RascalMetricProvider.computeFolders(RascalMetricProvider.java:399)
       at org.eclipse.scava.metricprovider.rascal.RascalMetricProvider.compute(RascalMetricProvider.java:251)
       - locked <21de2b87> (a java.lang.Class)
       at org.eclipse.scava.metricprovider.rascal.RascalMetricProvider.measure(RascalMetricProvider.java:213)
       at org.eclipse.scava.metricprovider.rascal.RascalMetricProvider.measure(RascalMetricProvider.java:1)
       at org.eclipse.scava.platform.osgi.analysis.MetricListExecutor.run(MetricListExecutor.java:103)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)

the checkout() waits for a UNIX process the related code is : Process p = Runtime.getRuntime().exec(new String[] { "git", "pull"}, new String[] { }, workingDirectory);

davidediruscio commented 4 years ago

Please see whether 213db8a1623436fe708cf9f250b2dd5752188651 has any influence.

davidediruscio commented 4 years ago

I detect the same problem, below the steps I did:

docker system prune -a
docker-compose -f docker-compose-build.yml build --no-cache
docker-compose -f docker-compose-build.yml up
add gitlab token
add project https://gitlab.ow2.org/clif/clif-legacy
fill task 30/01/2018 to 31/01/2018 all metrics
start task

enter in oss-app and execute ps axf

slimbook@slimbook-KATANA:~/Escritorio/sources/scava-deployment$ docker exec -it 53c009843d63 bash
root@53c009843d63:/ossmeter# ls
artifacts.xml  configuration  eclipse  eclipse.ini  features  git  p2  plugins    prop.properties  wait-for-it.sh  workspace
root@53c009843d63:/ossmeter# ps axf
PID TTY      STAT   TIME COMMAND
251 pts/0    Ss     0:00 bash
259 pts/0    R+     0:00  \_ ps axf
1 ?        Ss     0:00 ./eclipse -master -apiServer -worker w1 -config prop.properties
128 ?        Sl     3:42 /usr/local/openjdk-8/bin/java -Xmx8g -Dcom.sun.management.jmxremote.rmi.port=8192 -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=8192 -Dcom.sun.manageme
217 ?        S      0:00  \_ git clone https://gitlab.ow2.org/clif/clif-legacy.git /root/scava/cliflegacy/workingCopies/https_3a__2f__2f_gitlab_2e_ow2_2e_org_2f_clif_2f_clif_2d_legacy_2e_git
219 ?        S      0:02  |   \_ git-remote-https origin https://gitlab.ow2.org/clif/clif-legacy.git
238 ?        Sl     0:00  |       \_ git fetch-pack --stateless-rpc --stdin --lock-pack --thin --check-self-contained-and-connected --cloning --no-progress https://gitlab.ow2.org/clif/clif-legacy.git/
244 ?        R      0:06  |           \_ git index-pack --stdin --fix-thin --keep=fetch-pack 238 on 53c009843d63 --check-self-contained-and-connected --pack_header=2,38080
221 ?        S      0:00  \_ git pull
223 ?        S      0:00  |   \_ git fetch --update-head-ok
228 ?        S      0:02  |       \_ git-remote-https origin https://gitlab.ow2.org/clif/clif-legacy.git
237 ?        Sl     0:00  |           \_ git fetch-pack --stateless-rpc --stdin --lock-pack --thin --no-progress https://gitlab.ow2.org/clif/clif-legacy.git/
246 ?        S      0:05  |               \_ git index-pack --stdin --fix-thin --keep=fetch-pack 237 on 53c009843d63 --pack_header=2,38058
222 ?        S      0:00  \_ git pull
225 ?        S      0:00  |   \_ git fetch --update-head-ok
227 ?        S      0:01  |       \_ git-remote-https origin https://gitlab.ow2.org/clif/clif-legacy.git
240 ?        Sl     0:00  |           \_ git fetch-pack --stateless-rpc --stdin --lock-pack --thin --no-progress https://gitlab.ow2.org/clif/clif-legacy.git/
248 ?        S      0:04  |               \_ git index-pack --stdin --fix-thin --keep=fetch-pack 240 on 53c009843d63 --pack_header=2,38058
231 ?        S      0:00  \_ git pull
233 ?        S      0:00      \_ git fetch --update-head-ok
234 ?        S      0:02          \_ git-remote-https origin https://gitlab.ow2.org/clif/clif-legacy.git
242 ?        Sl     0:00              \_ git fetch-pack --stateless-rpc --stdin --lock-pack --thin --no-progress https://gitlab.ow2.org/clif/clif-legacy.git/
250 ?        R      0:05                  \_ git index-pack --stdin --fix-thin --keep=fetch-pack 242 on 53c009843d63 --pack_header=2,38058

execute again ps axf

root@53c009843d63:/ossmeter# ps axf
PID TTY      STAT   TIME COMMAND
251 pts/0    Ss     0:00 bash
478 pts/0    R+     0:00  \_ ps axf
1 ?        Ss     0:00 ./eclipse -master -apiServer -worker w1 -config prop.properties
128 ?        Sl     3:59 /usr/local/openjdk-8/bin/java -Xmx8g -Dcom.sun.management.jmxremote.rmi.port=8192 -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=8192 -Dcom.sun.manageme
231 ?        S      0:00  \_ git pull

the task doesn't seem to be blocked from the logs, but the analysis seems to say the opposite (see figure below)

oss-app_1         | Jun 06, 2019 6:46:19 AM org.restlet.engine.log.LogFilter afterHandle
oss-app_1         | INFO: 2019-06-06    06:46:19    172.19.0.9    -    172.19.0.6    8182    GET    /analysis/tasks/project/cliflegacy    -    200    27198    0    54    http://oss-app:8182    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36    http://localhost:5601/
dashb-importer_1  | 2019-06-06 06:46:20,959 500 items uploaded to ES (https://admin:admin@elasticsearch:9200/scava-metrics/items/_bulk)
oss-app_1         | Jun 06, 2019 6:46:21 AM org.restlet.engine.log.LogFilter afterHandle

Captura de pantalla de 2019-06-05 17-51-03

davidediruscio commented 4 years ago

Please see whether 213db8a has any influence.

Doesn't seem to on my side.

davidediruscio commented 4 years ago

I've been running a task over 11 months (2018), all metrics, with project https://github.com/INRIA/spoon and haven't encountered the issue yet. So maybe it's more related to gitlab projects ?

davidediruscio commented 4 years ago

Please, is someone can have a look ? @ambpro ?

davidediruscio commented 4 years ago

@davidediruscio I'll try to reproduce the issue and help, but I'm not charged of the git readers, which seems to be the cause of the issue. EHU uses the REST API to access the issues from Gitlab.

davidediruscio commented 4 years ago

I've been trying to reproduce the issue with other Gitlab repositories but I don't see the issue, and I couldn't see any part of the workingcopy code where could it be an issue.

davidediruscio commented 4 years ago

Thanks @creat89 . Notice that the procedure above doesn't make the issue to happen for every try. You may have to run it several times. Please @MarcioMateus , @borisbaldassari , could you try ? It would just take about 5min of your time. Thanks.

davidediruscio commented 4 years ago

@blueoly , could you help to find what is the issue ?

davidediruscio commented 4 years ago

@blueoly, @tdegueul, I have some question.

Is it possible that multiple versions of the same commit (or commits from the same day) are being requested at (more or less) the same time and causing collisions? In the sense that different metrics based on files, which do not have common dependencies, might be asking for multiple local versions at the same time for the same project?
I see that every time that we ask for a checkout of the repository, there are two possibles task to do:
- Do a clone if the repo doesn't exist
- Do a pull if the repo exists I think to pull every time that we request for a checkout is quite extreme, but at the same time I understand the reason to use it: We cannot be sure whether the current pull will cover the commit that we are looking for. Especially, when we are analyzing a day that happens after the starting day of the analysis. What happens if, we include an extra condition before requesting a pull that checks whether the current pull contains the commit? git branch --contains $COMMIT_ID or git merge-base --is-ancestor $COMMIT_ID HEAD If the branch does not contain the commit, then do a pull.
As well, @mhow2 was suggesting maybe to consolidate the clone or update of a repository, so it is done just once. I think it could be possible to do at the GitManager level.

davidediruscio commented 4 years ago

@mhow2 Martin, I will try to reproduce it tonight and come back with any updates.

@creat89 There is no chance to request many commits from the same day. My metrics and I am pretty sure that @tdegueul metrics use only the last commit of each day.

I am not sure but I believe that requesting the same commit multiple times is not a problem because this commit will be already checked out form the first request and a new checkout will not do anything but inform the repo that the commit is ready.

In the same fashion, I think that the checks that you suggest are more "heavy" than a plain git pull. If the commit is already pulled, nothing is going to happen (not exactly nothing, a check will be made to check if the repo is up to date). If the commit is not there, a pull will be executed but it will be executed once for each commit. I am not a git expert but I feel that this is not something that will create a considerable overhead or reasons to stuck the platform. And if this was related with git pull, it should affect not only gitlab projects but also github.

davidediruscio commented 4 years ago

We also add the --verbose flag in the git pull command to see when and if it gets stuck. Its output will be visible through the platform logging, right?

davidediruscio commented 4 years ago

I did some testing on this case. I saw a behaviour similar to the that one that @valeriocos reported. However, in one run the task crashed (at -68% progress)

I saw the following exception on oss-app log:

Exception in thread "pool-9-thread-5" java.lang.StackOverflowError
    at java.lang.Class.searchMethods(Class.java:3003)
    at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
    at java.lang.Class.getMethod0(Class.java:3018)
    at java.lang.Class.getMethod(Class.java:1784)
    at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:847)
    at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:817)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:814)
    at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:421)
    at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:657)
    at org.apache.http.conn.ssl.DefaultHostnameVerifier.<init>(DefaultHostnameVerifier.java:72)
    at org.apache.http.impl.nio.client.HttpAsyncClientBuilder.build(HttpAsyncClientBuilder.java:661)
    at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:216)
    at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:213)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.elasticsearch.client.RestClientBuilder.createHttpClient(RestClientBuilder.java:213)
    at org.elasticsearch.client.RestClientBuilder.access$000(RestClientBuilder.java:42)
    at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:186)
    at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:183)
    at java.security.AccessController.doPrivileged(Native Method)
    at org.elasticsearch.client.RestClientBuilder.build(RestClientBuilder.java:183)
    at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:181)
    at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)
    at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)

After this I saw multiple repetitions of the last line.

This this ring any bell? I am not sure if this behaviour is related with the problem inicial problem reported by @mhow2 or is a different one.

davidediruscio commented 4 years ago

Hi @MarcioMateus, myself and @creat89 do not think that https://github.com/crossminer/scava/issues/276#issuecomment-502269094 is related to this issue (#276). However, please provide us with some further information surrounding the stack trace and we can look into it further.

davidediruscio commented 4 years ago

I did a new test. The task froze for some minutes at -100% (minus!) but started and then crashed on at -22%.

You can see below the stack trace of the metrics. Several metrics presented some kind of error. stack-trace.json.txt

I still see similar error on oss-app:

oss-app_1         | Exception in thread "pool-9-thread-3" INFO  [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Starting execution AnalysisTask 'cliflegacy:task4' with MetricExecution 'trans.rascal.OO.java.A-Java'
oss-app_1         | INFO  [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Adding dependencies to metricProvider 'trans.rascal.OO.java.A-Java' into project 'cliflegacy'
oss-app_1         | INFO  [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Added dependencies to metricProvider 'trans.rascal.OO.java.A-Java' into project 'cliflegacy'
oss-app_1         | INFO  [AnalysisSchedulingService] (15:13:48): Starting MetricExecution 'trans.rascal.OO.java.A-Java'
oss-app_1         | INFO  [AnalysisSchedulingService] (15:13:48): Update the worker 'w1' heartBeat Mon Jun 17 15:13:48 UTC 2019'
oss-app_1         | INFO  [AnalysisSchedulingService] (15:13:48): Starting MetricExecution 'trans.rascal.OO.java.A-Java' is done.
oss-app_1         | INFO  [AnalysisSchedulingService] (15:13:48): Starting find out MetricExecution 'trans.rascal.OO.java.A-Java' on Project 'cliflegacy'
oss-app_1         | INFO  [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Starting Metric Execution (A_Java).
oss-app_1         | java.lang.StackOverflowError
oss-app_1         |     at java.lang.Class.searchMethods(Class.java:3003)
oss-app_1         |     at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
oss-app_1         |     at java.lang.Class.getMethod0(Class.java:3018)
oss-app_1         |     at java.lang.Class.getMethod(Class.java:1784)
oss-app_1         |     at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:847)
oss-app_1         |     at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:817)
oss-app_1         |     at java.security.AccessController.doPrivileged(Native Method)
oss-app_1         |     at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:814)
oss-app_1         |     at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:421)
oss-app_1         |     at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:657)
oss-app_1         |     at org.apache.http.impl.nio.client.MainClientExec.<init>(MainClientExec.java:82)
oss-app_1         |     at org.apache.http.impl.nio.client.HttpAsyncClientBuilder.build(HttpAsyncClientBuilder.java:847)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:216)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:213)
oss-app_1         |     at java.security.AccessController.doPrivileged(Native Method)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder.createHttpClient(RestClientBuilder.java:213)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder.access$000(RestClientBuilder.java:42)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:186)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:183)
oss-app_1         |     at java.security.AccessController.doPrivileged(Native Method)
oss-app_1         |     at org.elasticsearch.client.RestClientBuilder.build(RestClientBuilder.java:183)
oss-app_1         |     at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:181)
oss-app_1         |     at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)
oss-app_1         |     at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)

But before this error, if find multiple times this message:

oss-app_1         | INFO  [Indexer Singleton] (15:13:47): Client Connection Failure
oss-app_1         | INFO  [Indexer Singleton] (15:13:47): Client Connection Failure
oss-app_1         | INFO  [Indexer Singleton] (15:13:47): java.io.IOException: Connection refused

I don't know which client is having problems...

Let me know if you want me to do any specific test.

davidediruscio commented 4 years ago

hi @MarcioMateus I don't think the issue you are reporting is the same, believing the observations @valeriocos and myself made (stuck git pull processes). I think you could open another issue report for yours. However, if you can try to reproduce the steps we have described, that would be helpful.

davidediruscio commented 4 years ago

@MarcioMateus Open another issue for relating to the client and I will look into it.

davidediruscio commented 4 years ago

This is a profiler snapshot for VisualVM, shot at the time the issue happens

stucktask-cliflegacy.nps.gz

davidediruscio commented 4 years ago

The same happened to me on project https://github.com/RocketChat/Rocket.Chat so it's not especially related to gitlab

davidediruscio commented 4 years ago

I guess that in order to improve the debugging of this issue, it is necessary to make more verbose when it is being called the pull commands in each project. Apart from the method that downloads a local copy of a repository and calls pull every checkout, is there any other metric doing pulls? This is in order to focus only on the readers and metrics that could trigger this issue, instead of running all the metrics.

davidediruscio commented 4 years ago

+1 @creat89

BTW I bumped into the issue with project https://github.com/RocketChat/Rocket.Chat (all metrics enabled). It's stuck on:

    13 ?        Sl     2:29 /usr/bin/java -Xmx8g -Dcom.sun.management.jmxremote.rmi.port=8192 -Dcom.sun.management.jmxremote=tr
  470 ?        S      0:00  \_ git pull
  539 ?        S      0:00      \_ git merge FETCH_HEAD

The last line in metric-platform logs is:

INFO [MetricListExecutor (RocketChat, 20180102)] (09:34:27): Ending execution AnalysisTask 'RocketChat:task' with MetricExecution 'style.spreadOfStyleViolations.historic' done in 9 ms

davidediruscio commented 4 years ago

What I can say is the metric providers in the original description of #312 don't look concerned.

davidediruscio commented 4 years ago

After running project https://github.com/chaoss/augur with all metrics, to address https://github.com/crossminer/scava/issues/328, I runned ps axf within the oss-app container and got the following output

root@dfe65d594f34:/ossmeter# ps axf
  PID TTY      STAT   TIME COMMAND
19304 pts/2    Ss     0:00 bash
19314 pts/2    R+     0:00  \_ ps axf
18405 pts/1    Ss+    0:00 bash
20434 pts/0    Ss+    0:00 bash
    1 ?        Ss     0:00 ./eclipse -master -apiServer -worker w1 -config prop.properties
   13 ?        Sl   117:44 /usr/local/openjdk-8/bin/java -Xmx8g -Dcom.sun.management.jmxremote.rmi.port=8192 -Dcom.sun.management.jmxremote=true -Dcom.su
  491 ?        Z      0:00 [git-remote-http] <defunct>
  494 ?        Z      0:00 [git-remote-http] <defunct>
  499 ?        Z      0:00 [git-remote-http] <defunct>
  511 ?        Z      0:00 [git-remote-http] <defunct>
  512 ?        Z      0:00 [git-remote-http] <defunct>
  513 ?        Z      0:00 [git-remote-http] <defunct>
  520 ?        Z      0:00 [git-remote-http] <defunct>
 6081 ?        Z      0:00 [git-remote-http] <defunct>
 6082 ?        Z      0:00 [git-remote-http] <defunct>
 6083 ?        Z      0:00 [git-remote-http] <defunct>
 6086 ?        Z      0:00 [git-remote-http] <defunct>
 6087 ?        Z      0:00 [git-remote-http] <defunct>
14893 ?        Z      0:00 [git-remote-http] <defunct>
14894 ?        Z      0:00 [git-remote-http] <defunct>
14897 ?        Z      0:00 [git-remote-http] <defunct>
14898 ?        Z      0:00 [git-remote-http] <defunct>
14899 ?        Z      0:00 [git-remote-http] <defunct>
14900 ?        Z      0:00 [git-remote-http] <defunct>
14910 ?        Z      0:00 [git-remote-http] <defunct>

It seems we are still not properly closing some of the git connections.

davidediruscio commented 4 years ago

Hello everybody,

I found this bug in Eclipse Issue tracker:

https://bugs.eclipse.org/bugs/show_bug.cgi?id=545936

In summary, it says that jgit (the library behind doing the git pull in CROSSMINER) has issues when it is called multiple times if the Java version is the 11. @MarcioMateus and @mhow2, which version of Java are you using?

davidediruscio commented 4 years ago

HI @creat89 ,

We are running the the metric-platform on a docker container based on openjdk:8-jdk, so JDK 8.

davidediruscio commented 4 years ago

I have an intersting case today: root@75a55e4531a2:/ossmeter# ps aux

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  25136   104 ?        Ss   Sep19   0:00 ./eclipse -worker w2 -config prop.properties
root        94 24.0  6.7 15073780 2194452 ?    Sl   Sep19 347:06 /usr/local/openjdk-8/bin/java -Xmx8g -Dcom.sun.management.jmx
root     28050  0.0  0.0 258936  2264 ?        S    Sep19   0:00 git checkout 7850b4a2dd3d2fbd84f5b08c8077c045bf7a3265
root     28130  1.0  0.0  19948  3532 pts/0    Ss   09:54   0:00 bash
root     28135  0.0  0.0  38380  2956 pts/0    R+   09:54   0:00 ps aux

As you see the process git checkout is stuck since yesterday. Below is the related worker:

And the container logs: oss-slave-container.log.gz

eclipse-researchlabs / scava

stuck tasks : need people to test reproducing #31