Closed jhpoelen closed 5 years ago
All the jobs tagged with "spark" run on idb-jupyter1 and that host's agent service is stopped. I believe normal activity would be for that host to end up with most of the work.
Also, the only reason we have 2 jobs slots per host is that's the default and we didn't do the resource usage math to understand how many we can really run. We can increase that. Looking at this now.
Restarting the Jenkins master restored the connections to the slaves; trying to restart/start the slaves didn't work. The iDigBio test mini job completed.
I increased the number of build jobs on the master and idb-jupyter1 to 4. I think this is ok since non-spark jobs should be low memory/single threaded.
So root cause:
Because of increase usages and longer running job, the wait time for jobs to run on archive.guoda.bio is increasing. This leads to the perception that things are paused or no longer working.
Attached a screenshot of admin interface with pending jobs, offline agents (e.g., idb-jupyter1.acis.ufl.edu , moose.acis.ufl.edu).