LSSTDESC / Twinkles

10 years. 6 filters. 1 tiny patch of sky. Thousands of time-variable cosmological distance probes.
MIT License
13 stars 12 forks source link

Resuscitate ability of workflow engine to submit jobs to NERSC #85

Closed tony-johnson closed 8 years ago

tony-johnson commented 8 years ago

Needs to be upgraded to new batch system (SLURM) at NERSC

brianv0 commented 8 years ago

A few issues came up:

  1. slurm's squeue (tested on Edison) has a different behavior than PBS' qstat command. slurm only shows statuses of jobs waiting to be ran and I believe running. Previously qstat would show job completion status for something like 20 minutes after a job was completed.
  2. NEWT must internally use squeue as well, as this behavior is observed in NEWT as well
  3. NEWT doesn't have native support for sacct which can show job status of previous jobs
  4. sendmail on worker nodes doesn't appear to actually send out email, but I was able to successfully test sending email from a login node and that worked fine.
  5. The NEWT-based job control daemon doesn't do session management, so either the service needs to be restarted or a refresh mechanism needs to be in place in the daemon to refresh login cookies with fresh cookies before 24 hours is up.
brianv0 commented 8 years ago

All that being said, I was able to submit jobs, but they immediately failed, probably because no wall clock time was specified although I'm not sure on that.

tony-johnson commented 8 years ago

@brianv0 now has the workflow engine running at NERSC, and we have been able to run a small sample of TwinklesDM jobs using it.

The current implementation does not use NEWT, but instead requires that the daemon is run directly on a login node at NERSC. This seems to work fine, although currently it has to be manually started.

One remaining bug is that rollback does not currently work, but Brian will hopefully get that fixed today. Once that is done the workflow is in principle ready to be used for running Run2 at NERSC.

tony-johnson commented 8 years ago

Rollback is fixed. Closing this issue since basic functionality is now complete. Will open other issues for any additional work.

drphilmarshall commented 8 years ago

Hooray! Nice work, you two. So cool to have gained a supercomputing center :-)

On Thu, Jun 9, 2016 at 9:00 AM, Tony Johnson notifications@github.com wrote:

Closed #85 https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/85.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/85#event-687450152, or mute the thread https://github.com/notifications/unsubscribe/AArY9_ySL2yqc_G-W5OWzDnDcEgn5UMpks5qKDiQgaJpZM4HBUDp .