Closed justaddcoffee closed 3 years ago
It looked as if it was constantly doing something to /var/lib/jenkins/workspace/dge-graph-hub_kg-covid-19_master/gitrepo/merged-kg.jnl, but it did not change the file size. It could have just been very aggressively bit flipping.
There was definitely a fair amount of write going on, but I'm not exactly sure what was being written:
28125 be/4 jenkins 0.00 B/s 17.39 M/s 0.00 % 93.13 % java -Xmx128G -cp /var/lib/jenkins/workspace/dge~graph-hub_kg-covid-19_master/gi [com.bigdata.rws]
I would also note that at the docker level, it looked like:
bbop@stove:~$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
26234271d5bc justaddcoffee/ubuntu20-python-3-8-5-dev:4 "cat" 3 days ago Up 3 days upbeat_knuth
which would be consistent with an issue like https://stackoverflow.com/questions/54585747/jenkins-docker-container-simply-hangs-and-never-executes-steps
Whatever happened prevented docker from being about to safely close out the running image with stop and kill, which also prevented us from re-invading the image to see what was going on (possibly the processes needed to invade the image had already been offed).
In the end, the machine was power cycled. Due to other restrictions on the machine, we'll not be lightly doing that again.
I would try running that a second time first.
"... in case this was something that rebooting might fix" The only things a reboot would fix would be cruft in temp filesystems, otherwise it would/could occur again in the future under some circumstance. Also noting a manual purging of the filesystem that would not have been touched by a reboot.
Interestingly, it failed the second time, but this time because the journal file from the previous run seems to still exist, which is puzzling:
+ pigz ../merged-kg.jnl
17:14:52 pigz: abort: write error on ../merged-kg.jnl.gz (Inappropriate ioctl for device)
^^ this is pigz's unhelpful argument saying that it wants to ask if it should overwrite the file, but it can't because this isn't an interactive terminal
I had a feeling something like that might happen. Wanna guess where my current theory is heading ;) I think once a workspace has been dirtied in certain ways, things get "weird". We can try and track down these as specific cases until a pattern comes up.
Wanna guess where my current theory is heading ;)
Haha before we go blaming my beloved jenkinsuser
, I am going to try cleanWs()
and see if it removes files from previous runs...
It's evident that previous runs in a given workspace are causing some issues in this Jenkins run:
Although it's not clear if this is the cause of the blazegraph step hanging
This #432 may have fixed this. Next Jenkins run here may help confirm this
Confirming that this is fixed. I am assuming (without absolute proof) that my failure to remove stuff from previous runs was causing this problem
Describe the bug
When running Jenkins pipeline, the process hangs for >3 days on the blazegraph journal stage - see here for Jenkins logs. Here's the chatter before/during the hang:
To Reproduce
Run pipeline on https://github.com/Knowledge-Graph-Hub/kg-covid-19/commit/157cd65ed00e6cf5a3bc19d4dfc697593ae92132
Expected behavior
Should run to completion
Version
https://github.com/Knowledge-Graph-Hub/kg-covid-19/commit/157cd65ed00e6cf5a3bc19d4dfc697593ae92132
Additional context
Discussed with @kltm and had a look at the Docker image while hanging. It did not seem to actually be writing anything to the blazegraph journal file. Not clear why.