Open mhow2 opened 5 years ago
Please see whether 213db8a1623436fe708cf9f250b2dd5752188651 has any influence.
I detect the same problem, below the steps I did:
docker system prune -a
docker-compose -f docker-compose-build.yml build --no-cache
docker-compose -f docker-compose-build.yml up
https://gitlab.ow2.org/clif/clif-legacy
30/01/2018 to 31/01/2018
all metricsslimbook@slimbook-KATANA:~/Escritorio/sources/scava-deployment$ docker exec -it 53c009843d63 bash
root@53c009843d63:/ossmeter# ls
artifacts.xml configuration eclipse eclipse.ini features git p2 plugins prop.properties wait-for-it.sh workspace
root@53c009843d63:/ossmeter# ps axf
PID TTY STAT TIME COMMAND
251 pts/0 Ss 0:00 bash
259 pts/0 R+ 0:00 \_ ps axf
1 ? Ss 0:00 ./eclipse -master -apiServer -worker w1 -config prop.properties
128 ? Sl 3:42 /usr/local/openjdk-8/bin/java -Xmx8g -Dcom.sun.management.jmxremote.rmi.port=8192 -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=8192 -Dcom.sun.manageme
217 ? S 0:00 \_ git clone https://gitlab.ow2.org/clif/clif-legacy.git /root/scava/cliflegacy/workingCopies/https_3a__2f__2f_gitlab_2e_ow2_2e_org_2f_clif_2f_clif_2d_legacy_2e_git
219 ? S 0:02 | \_ git-remote-https origin https://gitlab.ow2.org/clif/clif-legacy.git
238 ? Sl 0:00 | \_ git fetch-pack --stateless-rpc --stdin --lock-pack --thin --check-self-contained-and-connected --cloning --no-progress https://gitlab.ow2.org/clif/clif-legacy.git/
244 ? R 0:06 | \_ git index-pack --stdin --fix-thin --keep=fetch-pack 238 on 53c009843d63 --check-self-contained-and-connected --pack_header=2,38080
221 ? S 0:00 \_ git pull
223 ? S 0:00 | \_ git fetch --update-head-ok
228 ? S 0:02 | \_ git-remote-https origin https://gitlab.ow2.org/clif/clif-legacy.git
237 ? Sl 0:00 | \_ git fetch-pack --stateless-rpc --stdin --lock-pack --thin --no-progress https://gitlab.ow2.org/clif/clif-legacy.git/
246 ? S 0:05 | \_ git index-pack --stdin --fix-thin --keep=fetch-pack 237 on 53c009843d63 --pack_header=2,38058
222 ? S 0:00 \_ git pull
225 ? S 0:00 | \_ git fetch --update-head-ok
227 ? S 0:01 | \_ git-remote-https origin https://gitlab.ow2.org/clif/clif-legacy.git
240 ? Sl 0:00 | \_ git fetch-pack --stateless-rpc --stdin --lock-pack --thin --no-progress https://gitlab.ow2.org/clif/clif-legacy.git/
248 ? S 0:04 | \_ git index-pack --stdin --fix-thin --keep=fetch-pack 240 on 53c009843d63 --pack_header=2,38058
231 ? S 0:00 \_ git pull
233 ? S 0:00 \_ git fetch --update-head-ok
234 ? S 0:02 \_ git-remote-https origin https://gitlab.ow2.org/clif/clif-legacy.git
242 ? Sl 0:00 \_ git fetch-pack --stateless-rpc --stdin --lock-pack --thin --no-progress https://gitlab.ow2.org/clif/clif-legacy.git/
250 ? R 0:05 \_ git index-pack --stdin --fix-thin --keep=fetch-pack 242 on 53c009843d63 --pack_header=2,38058
root@53c009843d63:/ossmeter# ps axf
PID TTY STAT TIME COMMAND
251 pts/0 Ss 0:00 bash
478 pts/0 R+ 0:00 \_ ps axf
1 ? Ss 0:00 ./eclipse -master -apiServer -worker w1 -config prop.properties
128 ? Sl 3:59 /usr/local/openjdk-8/bin/java -Xmx8g -Dcom.sun.management.jmxremote.rmi.port=8192 -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=8192 -Dcom.sun.manageme
231 ? S 0:00 \_ git pull
the task doesn't seem to be blocked from the logs, but the analysis seems to say the opposite (see figure below)
oss-app_1 | Jun 06, 2019 6:46:19 AM org.restlet.engine.log.LogFilter afterHandle
oss-app_1 | INFO: 2019-06-06 06:46:19 172.19.0.9 - 172.19.0.6 8182 GET /analysis/tasks/project/cliflegacy - 200 27198 0 54 http://oss-app:8182 Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36 http://localhost:5601/
dashb-importer_1 | 2019-06-06 06:46:20,959 500 items uploaded to ES (https://admin:admin@elasticsearch:9200/scava-metrics/items/_bulk)
oss-app_1 | Jun 06, 2019 6:46:21 AM org.restlet.engine.log.LogFilter afterHandle
Please see whether 213db8a has any influence.
Doesn't seem to on my side.
I've been running a task over 11 months (2018), all metrics, with project https://github.com/INRIA/spoon and haven't encountered the issue yet. So maybe it's more related to gitlab projects ?
Please, is someone can have a look ? @ambpro ?
@davidediruscio I'll try to reproduce the issue and help, but I'm not charged of the git readers, which seems to be the cause of the issue. EHU uses the REST API to access the issues from Gitlab.
I've been trying to reproduce the issue with other Gitlab repositories but I don't see the issue, and I couldn't see any part of the workingcopy code where could it be an issue.
Thanks @creat89 . Notice that the procedure above doesn't make the issue to happen for every try. You may have to run it several times. Please @MarcioMateus , @borisbaldassari , could you try ? It would just take about 5min of your time. Thanks.
@blueoly , could you help to find what is the issue ?
@blueoly, @tdegueul, I have some question.
git branch --contains $COMMIT_ID
or git merge-base --is-ancestor $COMMIT_ID HEAD
If the branch does not contain the commit, then do a pull.@mhow2 Martin, I will try to reproduce it tonight and come back with any updates.
@creat89 There is no chance to request many commits from the same day. My metrics and I am pretty sure that @tdegueul metrics use only the last commit of each day.
I am not sure but I believe that requesting the same commit multiple times is not a problem because this commit will be already checked out form the first request and a new checkout will not do anything but inform the repo that the commit is ready.
In the same fashion, I think that the checks that you suggest are more "heavy" than a plain git pull. If the commit is already pulled, nothing is going to happen (not exactly nothing, a check will be made to check if the repo is up to date). If the commit is not there, a pull will be executed but it will be executed once for each commit. I am not a git expert but I feel that this is not something that will create a considerable overhead or reasons to stuck the platform. And if this was related with git pull, it should affect not only gitlab projects but also github.
We also add the --verbose flag in the git pull command to see when and if it gets stuck. Its output will be visible through the platform logging, right?
I did some testing on this case. I saw a behaviour similar to the that one that @valeriocos reported. However, in one run the task crashed (at -68% progress)
I saw the following exception on oss-app
log:
Exception in thread "pool-9-thread-5" java.lang.StackOverflowError
at java.lang.Class.searchMethods(Class.java:3003)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:847)
at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:817)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:814)
at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:421)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:657)
at org.apache.http.conn.ssl.DefaultHostnameVerifier.<init>(DefaultHostnameVerifier.java:72)
at org.apache.http.impl.nio.client.HttpAsyncClientBuilder.build(HttpAsyncClientBuilder.java:661)
at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:216)
at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:213)
at java.security.AccessController.doPrivileged(Native Method)
at org.elasticsearch.client.RestClientBuilder.createHttpClient(RestClientBuilder.java:213)
at org.elasticsearch.client.RestClientBuilder.access$000(RestClientBuilder.java:42)
at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:186)
at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:183)
at java.security.AccessController.doPrivileged(Native Method)
at org.elasticsearch.client.RestClientBuilder.build(RestClientBuilder.java:183)
at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:181)
at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)
at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)
After this I saw multiple repetitions of the last line.
This this ring any bell? I am not sure if this behaviour is related with the problem inicial problem reported by @mhow2 or is a different one.
Hi @MarcioMateus, myself and @creat89 do not think that https://github.com/crossminer/scava/issues/276#issuecomment-502269094 is related to this issue (#276). However, please provide us with some further information surrounding the stack trace and we can look into it further.
I did a new test. The task froze for some minutes at -100% (minus!) but started and then crashed on at -22%.
You can see below the stack trace of the metrics. Several metrics presented some kind of error. stack-trace.json.txt
I still see similar error on oss-app
:
oss-app_1 | Exception in thread "pool-9-thread-3" INFO [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Starting execution AnalysisTask 'cliflegacy:task4' with MetricExecution 'trans.rascal.OO.java.A-Java'
oss-app_1 | INFO [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Adding dependencies to metricProvider 'trans.rascal.OO.java.A-Java' into project 'cliflegacy'
oss-app_1 | INFO [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Added dependencies to metricProvider 'trans.rascal.OO.java.A-Java' into project 'cliflegacy'
oss-app_1 | INFO [AnalysisSchedulingService] (15:13:48): Starting MetricExecution 'trans.rascal.OO.java.A-Java'
oss-app_1 | INFO [AnalysisSchedulingService] (15:13:48): Update the worker 'w1' heartBeat Mon Jun 17 15:13:48 UTC 2019'
oss-app_1 | INFO [AnalysisSchedulingService] (15:13:48): Starting MetricExecution 'trans.rascal.OO.java.A-Java' is done.
oss-app_1 | INFO [AnalysisSchedulingService] (15:13:48): Starting find out MetricExecution 'trans.rascal.OO.java.A-Java' on Project 'cliflegacy'
oss-app_1 | INFO [MetricListExecutor (cliflegacy, 20180130)] (15:13:48): Starting Metric Execution (A_Java).
oss-app_1 | java.lang.StackOverflowError
oss-app_1 | at java.lang.Class.searchMethods(Class.java:3003)
oss-app_1 | at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
oss-app_1 | at java.lang.Class.getMethod0(Class.java:3018)
oss-app_1 | at java.lang.Class.getMethod(Class.java:1784)
oss-app_1 | at org.apache.commons.logging.LogFactory.directGetContextClassLoader(LogFactory.java:847)
oss-app_1 | at org.apache.commons.logging.LogFactory$1.run(LogFactory.java:817)
oss-app_1 | at java.security.AccessController.doPrivileged(Native Method)
oss-app_1 | at org.apache.commons.logging.LogFactory.getContextClassLoaderInternal(LogFactory.java:814)
oss-app_1 | at org.apache.commons.logging.LogFactory.getFactory(LogFactory.java:421)
oss-app_1 | at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:657)
oss-app_1 | at org.apache.http.impl.nio.client.MainClientExec.<init>(MainClientExec.java:82)
oss-app_1 | at org.apache.http.impl.nio.client.HttpAsyncClientBuilder.build(HttpAsyncClientBuilder.java:847)
oss-app_1 | at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:216)
oss-app_1 | at org.elasticsearch.client.RestClientBuilder$2.run(RestClientBuilder.java:213)
oss-app_1 | at java.security.AccessController.doPrivileged(Native Method)
oss-app_1 | at org.elasticsearch.client.RestClientBuilder.createHttpClient(RestClientBuilder.java:213)
oss-app_1 | at org.elasticsearch.client.RestClientBuilder.access$000(RestClientBuilder.java:42)
oss-app_1 | at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:186)
oss-app_1 | at org.elasticsearch.client.RestClientBuilder$1.run(RestClientBuilder.java:183)
oss-app_1 | at java.security.AccessController.doPrivileged(Native Method)
oss-app_1 | at org.elasticsearch.client.RestClientBuilder.build(RestClientBuilder.java:183)
oss-app_1 | at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:181)
oss-app_1 | at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)
oss-app_1 | at org.eclipse.scava.index.indexer.IndexerSingleton.createHighLevelClientLocal(IndexerSingleton.java:202)
But before this error, if find multiple times this message:
oss-app_1 | INFO [Indexer Singleton] (15:13:47): Client Connection Failure
oss-app_1 | INFO [Indexer Singleton] (15:13:47): Client Connection Failure
oss-app_1 | INFO [Indexer Singleton] (15:13:47): java.io.IOException: Connection refused
I don't know which client
is having problems...
Let me know if you want me to do any specific test.
hi @MarcioMateus
I don't think the issue you are reporting is the same, believing the observations @valeriocos and myself made (stuck git pull
processes). I think you could open another issue report for yours.
However, if you can try to reproduce the steps we have described, that would be helpful.
@MarcioMateus Open another issue for relating to the client and I will look into it.
This is a profiler snapshot for VisualVM, shot at the time the issue happens
The same happened to me on project https://github.com/RocketChat/Rocket.Chat so it's not especially related to gitlab
I guess that in order to improve the debugging of this issue, it is necessary to make more verbose when it is being called the pull commands in each project. Apart from the method that downloads a local copy of a repository and calls pull every checkout, is there any other metric doing pulls? This is in order to focus only on the readers and metrics that could trigger this issue, instead of running all the metrics.
+1 @creat89
BTW I bumped into the issue with project https://github.com/RocketChat/Rocket.Chat (all metrics enabled). It's stuck on:
13 ? Sl 2:29 /usr/bin/java -Xmx8g -Dcom.sun.management.jmxremote.rmi.port=8192 -Dcom.sun.management.jmxremote=tr
470 ? S 0:00 \_ git pull
539 ? S 0:00 \_ git merge FETCH_HEAD
The last line in metric-platform logs is:
INFO [MetricListExecutor (RocketChat, 20180102)] (09:34:27): Ending execution AnalysisTask 'RocketChat:task' with MetricExecution 'style.spreadOfStyleViolations.historic' done in 9 ms
What I can say is the metric providers in the original description of #312 don't look concerned.
After running project https://github.com/chaoss/augur with all metrics, to address https://github.com/crossminer/scava/issues/328, I runned ps axf
within the oss-app
container and got the following output
root@dfe65d594f34:/ossmeter# ps axf
PID TTY STAT TIME COMMAND
19304 pts/2 Ss 0:00 bash
19314 pts/2 R+ 0:00 \_ ps axf
18405 pts/1 Ss+ 0:00 bash
20434 pts/0 Ss+ 0:00 bash
1 ? Ss 0:00 ./eclipse -master -apiServer -worker w1 -config prop.properties
13 ? Sl 117:44 /usr/local/openjdk-8/bin/java -Xmx8g -Dcom.sun.management.jmxremote.rmi.port=8192 -Dcom.sun.management.jmxremote=true -Dcom.su
491 ? Z 0:00 [git-remote-http] <defunct>
494 ? Z 0:00 [git-remote-http] <defunct>
499 ? Z 0:00 [git-remote-http] <defunct>
511 ? Z 0:00 [git-remote-http] <defunct>
512 ? Z 0:00 [git-remote-http] <defunct>
513 ? Z 0:00 [git-remote-http] <defunct>
520 ? Z 0:00 [git-remote-http] <defunct>
6081 ? Z 0:00 [git-remote-http] <defunct>
6082 ? Z 0:00 [git-remote-http] <defunct>
6083 ? Z 0:00 [git-remote-http] <defunct>
6086 ? Z 0:00 [git-remote-http] <defunct>
6087 ? Z 0:00 [git-remote-http] <defunct>
14893 ? Z 0:00 [git-remote-http] <defunct>
14894 ? Z 0:00 [git-remote-http] <defunct>
14897 ? Z 0:00 [git-remote-http] <defunct>
14898 ? Z 0:00 [git-remote-http] <defunct>
14899 ? Z 0:00 [git-remote-http] <defunct>
14900 ? Z 0:00 [git-remote-http] <defunct>
14910 ? Z 0:00 [git-remote-http] <defunct>
It seems we are still not properly closing some of the git connections.
Hello everybody,
I found this bug in Eclipse Issue tracker:
https://bugs.eclipse.org/bugs/show_bug.cgi?id=545936
In summary, it says that jgit (the library behind doing the git pull in CROSSMINER) has issues when it is called multiple times if the Java version is the 11. @MarcioMateus and @mhow2, which version of Java are you using?
HI @creat89 ,
We are running the the metric-platform on a docker container based on openjdk:8-jdk
, so JDK 8.
I have an intersting case today: root@75a55e4531a2:/ossmeter# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 25136 104 ? Ss Sep19 0:00 ./eclipse -worker w2 -config prop.properties
root 94 24.0 6.7 15073780 2194452 ? Sl Sep19 347:06 /usr/local/openjdk-8/bin/java -Xmx8g -Dcom.sun.management.jmx
root 28050 0.0 0.0 258936 2264 ? S Sep19 0:00 git checkout 7850b4a2dd3d2fbd84f5b08c8077c045bf7a3265
root 28130 1.0 0.0 19948 3532 pts/0 Ss 09:54 0:00 bash
root 28135 0.0 0.0 38380 2956 pts/0 R+ 09:54 0:00 ps aux
As you see the process git checkout
is stuck since yesterday.
Below is the related worker:
And the container logs: oss-slave-container.log.gz
It seems I'm lucky enough with the platform to have task that hangs at some point. The issue happen very random, and what I see from the platform with the steps below is several
git pull
are spawned. It looks weird, at least to me, it feels like the process isn't thread safe. But it can also be a matter with the server itself....To try to reproduce the thing I'm observing with https://gitlab.ow2.org/clif/clif-legacy, you need:
0) add a
gitlabToken
property (see pinned message in #general in slack) 1) import https://gitlab.ow2.org/clif/clif-legacy 2) task 30/01/2018 to 31/01/2018 all metrics 3) run the task andps axf
within the container (or host) seeking for git commands 4) you might see an orphangit pull
and the task to freezeIf it doesn't freeze, loop over the testing again : 1) delete the task 2)
rm /root/scava/cliflegacy -rf
within the container 3) restart from 1) at the reproducing steps===
I'm quoting @tdegueul with his initial finding that led to better reproduce this bug