archive.guoda.bio has a long wait time

bio-guoda / guoda-services

Services provided by GUODA, currently a container for tickets and wikis.

MIT License

2 stars 0 forks source link

archive.guoda.bio has a long wait time #59

Closed jhpoelen closed 5 years ago

jhpoelen commented 5 years ago

Because of increase usages and longer running job, the wait time for jobs to run on archive.guoda.bio is increasing. This leads to the perception that things are paused or no longer working.

Attached a screenshot of admin interface with pending jobs, offline agents (e.g., idb-jupyter1.acis.ufl.edu , moose.acis.ufl.edu).

screenshot from 2018-11-05 08-49-26

mjcollin commented 5 years ago

All the jobs tagged with "spark" run on idb-jupyter1 and that host's agent service is stopped. I believe normal activity would be for that host to end up with most of the work.

Also, the only reason we have 2 jobs slots per host is that's the default and we didn't do the resource usage math to understand how many we can really run. We can increase that. Looking at this now.

mjcollin commented 5 years ago

Restarting the Jenkins master restored the connections to the slaves; trying to restart/start the slaves didn't work. The iDigBio test mini job completed.

I increased the number of build jobs on the master and idb-jupyter1 to 4. I think this is ok since non-spark jobs should be low memory/single threaded.

So root cause:

No slaves running -> restarted master
Slave have too few slots -> increased, monitor system usage.