adoptium / aqa-tests

Home of test infrastructure for Adoptium builds
https://adoptium.net/aqavit
Apache License 2.0
123 stars 304 forks source link

jdk_container left files owned by root #5358

Open llxia opened 1 month ago

llxia commented 1 month ago

jdk_container left files on the host machine that are owned by root. These files cannot be cleaned by Jenkins job. It causes Jenkins job to fail.

12:05:52  ERROR: Cannot delete workspace :Unable to delete '/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17113857837219/jdk_container_0/work/scratch/2/jdk-sharedtmp/.com_ibm_tools_attach/_controller'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
[Pipeline] echo
12:05:52  Exception: hudson.AbortException: Cannot delete workspace: Unable to delete '/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17113857837219/jdk_container_0/work/scratch/2/jdk-sharedtmp/.com_ibm_tools_attach/_controller'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
[Pipeline] sh
12:05:53  + rm -rf /home/jenkins/workspace/Grinder/aqa-tests/TKG
12:05:53  rm: cannot remove '/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17113857837219/jdk_container_0/work/scratch/2/jdk-sharedtmp/.com_ibm_tools_attach/_controller': Operation not permitted
12:05:53  rm: cannot remove '/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17113857837219/jdk_container_0/work/scratch/2/jdk-sharedtmp/.com_ibm_tools_attach/_notifier': Operation not permitted

@sophia-guo @smlambert do you also see a similar issue at Adoptium Jenkins? Is there a better way to resolve this?

sophia-guo commented 1 month ago

These tests were added around one and half years ago. As it's dev level may not run frequently. I didn't notice there is this issue. Check recent jdk21 seems no this issue.

https://ci.adoptium.net/view/Test_openjdk/job/Test_openjdk21_hs_dev.openjdk_x86-64_linux/

llxia commented 3 weeks ago

We should mark the node offline automatically when there is an error Cannot delete workspace: Unable to delete ...

AswathySK commented 2 weeks ago

Is there any other way to clean up the crash files in the test code itself instead of marking it offline @llxia

smlambert commented 2 weeks ago

Is there any other way to clean up the crash files in the test code itself instead of marking it offline @llxia

llxia is on vacation

Related: https://stackoverflow.com/questions/42423999/cant-delete-file-created-via-docker

sophia-guo commented 2 weeks ago

I think we also need to know why this happens. Does it only happen when impl=openj9|ibm as no issue reported with jdk_container running against impl=hotspot.

Normally this permission issue happens if you run things as root inside the container while using a mapped volume from the host inside the container. The jdkcontainer tests map volumes options are like `--volume /home/jenkins/workspace/jenkinsjobname/aqa-tests/TKG/output***/jdk_container_0/work/classes/2/....`, which is not the workdir. So shouldn't have this issue. Is there something specific to openj9|ibm caused this?

AswathySK commented 2 days ago

Is there any updates on this issue? Is @llxia back from vacation?

smlambert commented 2 days ago

I think we also need to know why this happens. Does it only happen when impl=openj9|ibm as no issue reported with jdk_container running against impl=hotspot.

If I had to guess, it happens when a testcase fails and doesn't cleanup after itself, then the workspace can not be deleted. So @AswathySK perhaps check if that is the case and exclude the failing testcases.

Lan is not back from vacation and no one is pursuing this issue further at this time. I suggest you dig in to answer some of the questions in this issue if you are interested in a different approach than taking the machine offline.

sophia-guo commented 1 day ago

Just a note that PR of making the node offine has also been reverted, which might help @AswathySK your investigation?

AswathySK commented 21 hours ago

@smlambert , when a test case fails it is not able to clean up after since the files created when it crashes are owned by root user. And yes I will do some more investigation on which all test cases we are seeing this issue.

smlambert commented 16 hours ago

So my point is, the reason we do not have a cleanup problem for Temurin is that there is not a failing/crashing testcase.

So your first task would be to see which testcase is crashing/failing, triage it by gathering any extra data you can, report the issue in the openj9 repo if it doesn't already exist, and exclude the test in the ProblemList files while the issue is being investigated and fixed by the openj9 team.