Test Workflow at NERSC - Githubissues

drphilmarshall commented 8 years ago

An epic task indeed! Not sure when the best time to do this is, but Run 1.1 could be a good opportunity, at least after it has been executed successfully at SLAC. If we are hoping to do Run 2 at NERSC, this could be the way to go.

Things that need doing are:

[x] Make a plan #116
[ ] Request shared service account #86
[x] Figure out how to submit jobs #87
[x] Get Workflow Engine working at NERSC #85
[x] Submit jobs! #87 Bypassing NEWT...
[ ] Create Dockerized images of Twinkles apps? #88

There are probably others, that we can add as we go. Assigning oversight of this to Tony!

tony-johnson commented 8 years ago

Now that Run 1.1 is complete, we should focus on getting things running at NERSC, with a goal to have things working by the end of May. @djbard should we organize a follow-up (phone) meeting to discuss some of the issues that were left from the meeting we had in December, such as running jobs under group account? @brianv0 are there other issues that we need to discuss with NERSC to get job submission to NERSC running.

djbard commented 8 years ago

That's a good idea. I can't meet in the "usual" Twinkles slot tomorrow, but next week (perhaps Tuesday?) would work.

On Thu, Apr 28, 2016 at 8:58 AM, Tony Johnson notifications@github.com wrote:

Now that Run 1.1 is complete, we should focus on getting things running at NERSC, with a goal to have things working by the end of May. @djbard https://github.com/djbard should we organize a follow-up (phone) meeting to discuss some of the issues that were left from the meeting we had in December, such as running jobs under group account? @brianv0 https://github.com/brianv0 are there other issues that we need to discuss with NERSC to get job submission to NERSC running.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-215476449

tony-johnson commented 8 years ago

Would Tuesday afternoon work, after 2pm (say 3pm to be concrete?).

Possible topics:

Throughput on serial queues
Max job time (especially as it applies to phosim jobs)
Group account for running jobs at NERSC #86
Submitting jobs directly to NERSC bypassing NEWT #87
Getting twinkles appropriate installs of phosim and DMStack at NERSC
(add more topics here)

djbard commented 8 years ago

3pm will be great. Do you have a BlueJeans number to use?

On Mon, May 2, 2016 at 9:20 AM, Tony Johnson notifications@github.com wrote:

Would Tuesday afternoon work, after 2pm (say 3pm to be concrete?).

Possible topics:

Throughput on serial queues

Max job time (especially as it applies to phosim jobs)

Group account for running jobs at NERSC #86 https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/86

Submitting jobs directly to NERSC bypassing NEWT #87 https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/87

Getting twinkles appropriate installs of phosim and DMStack at NERSC

(add more topics here)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-216282317

djbard commented 8 years ago

Hello: I talked to Shreyas today, who will probably be a useful person to have involved in this discussion since he's our NEWT expert. He's not available tomorrow however - he could meet on Friday before 12am, or in the 2.30 Twinkles slot. Perhaps we could have a half-meeting tomorrow to discuss the areas I can help with, and postpone the other half of the meeting until Friday? -Debbie

On Mon, May 2, 2016 at 10:42 AM, Deborah Bard djbard@lbl.gov wrote:

3pm will be great. Do you have a BlueJeans number to use?

On Mon, May 2, 2016 at 9:20 AM, Tony Johnson notifications@github.com wrote:

Would Tuesday afternoon work, after 2pm (say 3pm to be concrete?).

Possible topics:

Throughput on serial queues

Max job time (especially as it applies to phosim jobs)

Group account for running jobs at NERSC #86 https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/86

Submitting jobs directly to NERSC bypassing NEWT #87 https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/87

Getting twinkles appropriate installs of phosim and DMStack at NERSC

(add more topics here)

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-216282317

richardxdubois commented 8 years ago

If there is a meeting today (Tuesday) - I realized I have something on at 3, but 4 would work... if that doesn't work for everyone else, just carry on.

Richard

tony-johnson commented 8 years ago

OK, following Debbie's suggestion, we should meet this afternoon to discuss at least a subset of the topics listed above. Either 3pm or 4pm would work for me -- others should chime in if one time works better than the other. We can use https://bluejeans.com/256850342/. Everyone is welcome but I would suggest that useful attendees would include @brianv0 @TomGlanzman @djbard @jchiang87 @richardxdubois and perhaps @heather999 (although this meeting will be a little late on the east coast).

We can have another meeting on Friday with Shreyas, however we need to check Brian's availability. Brian will be away for 2 week following that, so getting this running by the end of May might be difficult.

djbard commented 8 years ago

I can only do 3pm this afternoon, I have other meetings at 4 (and in fact the rest of today). Let me know if I need to book a slot for me and Shreyas on Friday AM, @brianv0 https://github.com/brianv0.

On Tue, May 3, 2016 at 8:51 AM, Tony Johnson notifications@github.com wrote:

OK, following Debbie's suggestion, we should meet this afternoon to discuss at least a subset of the topics listed above. Either 3pm or 4pm would work for me -- others should chime in if one time works better than the other. We can use https://bluejeans.com/256850342/. Everyone is welcome but I would suggest that useful attendees would include @brianv0 https://github.com/brianv0 @TomGlanzman https://github.com/TomGlanzman @djbard https://github.com/djbard @jchiang87 https://github.com/jchiang87 @richardxdubois https://github.com/richardxdubois and perhaps @heather999 https://github.com/heather999 (although this meeting will be a little late on the east coast).

We can have another meeting on Friday with Shreyas, however we need to check Brian's availability. Brian will be away for 2 week following that, so getting this running by the end of May might be difficult.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-216573644

heather999 commented 8 years ago

I was pondering attending and I think I can manage 3 PM Pacific. Thanks!

tony-johnson commented 8 years ago

Sounds like the consensus is to meet at 3pm today. We can discuss possible follow up meeting on Friday at today's meeting.

tony-johnson commented 8 years ago

Meeting starting in 5 minutes. Notes and agenda are here:

https://docs.google.com/document/d/1NnsW3XZTQcJNqVi-3RZQivCqSMlY753S2MZk9dsbMNg/edit?usp=sharing

heather999 commented 8 years ago

Are we still planning to meet at 2:30 Pacific today?

tony-johnson commented 8 years ago

Debbie had suggested we meet at 2:30 today to discuss NEWT specific issues with Shreyas Cholia. I have not heard a definitive reply as to whether he can make 2:30, I just sent him a reminder, will post here if the meeting is happening. We could also use the meeting to further discuss other NERSC issues if we have it.

tony-johnson commented 8 years ago

We will have a meeting at 2:30 today, using https://bluejeans.com/256850342/.

It will focus on use of NEWT for submitting jobs from the workflow engine to NERSC. Topics include:

Brian has successfully submitted jobs via NEWT, but the job completion emails don't seem to be getting through. Is there some issue with NEWT that would prevent this from working?
Jobs disappear quickly from the slurm database - need to use sacct to access the job stats which NEWT doesn't currently support. Would it be possible to add this functionality? (this was a ticket, but I couldn't find the ticket number).
is it possible to use group accounts with NEWT, both for logging in and for submitting jobs. The reason for this would be to avoid having to build individual user's username/passwords into our code, which is something we normally try to avoid. If that is not possible is it possible for us to get an account initially attached to a user but which could be transfered to a different user if necessary in future.
Is NEWT really the best way to do what we want to do? One alternative we had discussed in the past is the possibility of running our own "job daemon" in a docker container.

tony-johnson commented 8 years ago

HI @djbard, I have tried running some jobs at NERSC by hand using the stack installed by @heather999 described in #242.

The good news is that if I submit a large number (~1200) of short (~00:30:00) jobs in the shared queue on cori they all seem to start pretty fast, in fact I think they all ran in about 2 hours.

The less good news is that f I submit only 10 jobs at a time (so far I am just running processEimage jobs) they run fine, but if I submit 100 jobs or more at a time they just appear to all hang and then die with:

slurmstepd: * JOB 2245087 ON nid02297 CANCELLED AT 2016-05-18T13:24:30 DUE TO TIME LIMIT *

I am writing the output to the scratch disk, so maybe they are just running very slowly due to disk contention? Are there any recommended tools for investigating issues like this (at SLAC for instance we can use ganglia to monitor disk loads, and tools like lsrun to look at the processes running on the batch nodes). Or documentation I should be reading? Or should I just contact the support desk?

djbard commented 8 years ago

Hi Tony: your best bet is absolutely (and is always!) to submit a ticket, including your batch scripts so they can look for problems with your syntax (that's pretty common with job arrays). You'll get a much faster and more comprehensive response than I can give you - NERSC consultants are great. But good news that your large array ran quickly! And that it can run at all, great that the stack is working now.

On Wed, May 18, 2016 at 5:14 PM, Tony Johnson notifications@github.com wrote:

HI @djbard https://github.com/djbard, I have tried running some jobs at NERSC by hand using the stack installed by @heather999 https://github.com/heather999 described in #242 https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/242.

The good news is that if I submit a large number (~1200) of short (~00:30:00) jobs in the shared queue on cori they all seem to start pretty fast, in fact I think they all ran in about 2 hours.

The less good news is that f I submit only 10 jobs at a time (so far I am just running processEimage jobs) they run fine, but if I submit 100 jobs or more at a time they just appear to all hang and then die with:

slurmstepd: * JOB 2245087 ON nid02297 CANCELLED AT 2016-05-18T13:24:30 DUE TO TIME LIMIT *

I am writing the output to the scratch disk, so maybe they are just running very slowly due to disk contention? Are there any recommended tools for investigating issues like this (at SLAC for instance we can use ganglia to monitor disk loads, and tools like lsrun to look at the processes running on the batch nodes). Or documentation I should be reading? Or should I just contact the support desk?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-220194865

drphilmarshall commented 8 years ago

Thanks @djbard! And you Tony, for pushing on this. I guess submitting jobs via the workflow engine is the next step, while the NERSC consultants chew on your log files: is #87 the right issue to work on?

Tony, it could be that you can tick off some of your checkboxes by closing some of those issues. I feel like you deserve some issue closure.

tony-johnson commented 8 years ago

Before contacting NERSC consultants I decided to contact DM support to see if they have any suggestions first:

https://community.lsst.org/t/disk-contention-when-running-many-dm-jobs-advice-needed/791

djbard commented 8 years ago

I'd strongly recommend asking the consultants anyway @tony-johnson - that's what they're there for! I am quite sure you have an issue with your scripts, not with $SCRATCH contention. You can check the CSCRATCH performance here: https://my.nersc.gov/filesystems-cs.php

tony-johnson commented 8 years ago

Impossible -- I never have issues with my scripts! But OK I have submitted an issue, I don't know if you @djbard or anyone else can see it?

https://nersc.service-now.com/nav_to.do?uri=incident.do%3Fsys_id=54248d4d0fa752009ce491a6d1050e9d%26sysparm_stack=incident_list.do%3Fsysparm_query=active=true

djbard commented 8 years ago

ha! Yes, I can see that (and am "watching" the ticket) but noone else will be able to. At the least, by filing the ticket we're raising awareness of how this project will be running jobs.

tony-johnson commented 8 years ago

Status update:

After moving all scripts, input and output data, and log files to the scratch disk, have succeeded in running all of the 1.1 eimage jobs at NERSC, although still with very low efficiency (~4% CPU/elapsed). Not sure keeping everything on scratch is a long-term solution, but there are somme other suggestions from DM and NERSC consultants to try (time permitting).

For now I plan to use the eimage output to try re-running the Run 1.1 coadd jobs at NERSC.

tony-johnson commented 8 years ago

Cori 06/11/16 4:30-06/27/16 8:00 PDT Scheduled maintenance. Cori will be unavailable while preparations are made for the installation of Cori Phase II. Login nodes and job submissions will be unavailable for the duration of the maintenance. The entire two week window is reserved, although the system may be available sooner depending on the progress of the maintenance.

djbard commented 8 years ago

FYI: I've gotten a shared partition installed on Edison for the duration of the outage (and perhaps for the rest of the summer, until after the Cori P1/P2 integration). It has 1300MB/core (so less than the Cori shared partition) but it will be useful.

heather999 commented 8 years ago

That sounds very useful @djbard, would we need a fresh DM installation for Edison?

djbard commented 8 years ago

I think you would need an Edison-specific installation, @heather999. But it should be the same installation process. Another thing to note - the upcoming downtime on Cori is to install a new OS, so the DM stack (and indeed all code) may need to be recompiled after it. We've been told that everything should recompile just fine, no changes necessary...

heather999 commented 8 years ago

Thanks, @djbard, yes I suspect this should be very straightforward to re-install. @tony-johnson is it ok if I get to that tonight or tomorrow? Or are you hoping to hop onto Edison sooner?

tony-johnson commented 8 years ago

@heather999 , I am not entirely sure I have plans to hop onto edison at all, because next week I will be busy with the camera workshop at SLAC, and I am not sure that we are ready to run Twinkles run 2. I think this should be a topic for tomorrow's twinkles meeting.

drphilmarshall commented 8 years ago

Yes - I added an agenda item at the top of the agenda. @djbard we'll report back here afterwards - you're welcome to join us at 12noon PDT as well, of course! :-)

tony-johnson commented 8 years ago

Run 2 (== Run 1.1 but at NERSC) is now complete (at last). Will add details (and close issue) tomorrow.

tony-johnson commented 8 years ago

More details on Run 2. The stream is completed, the pipeline task can be found here:

http://srs.slac.stanford.edu/Pipeline-II/exp/SRS/si.jsp?stream=41843928

Note that the links to the log files require the dev pipeline web interface so they do not work from the above link. ( @brianv0 can we move those changes to prod). The output from the jobs can be found here:

/global/cscratch1/sd/tony_j/Twinkles/output/work/2/output

Maybe this should be moved to a more permanent location.

Although the jobs eventually completed there are a number of outstanding issues:

Many jobs failed due to either NODE_FAIL or CANCELLED status and had to be repeatedly rerun before they eventually succeeded. There is an outstanding ticket to understand this here: INC0086948
The coadd jobs had to be given 40GBytes to run (meaning we got charged for 22 cores). The same jobs at SLAC appear to have only used 16GGBytes. Would be good to understand the difference.
Long delays caused by waiting for jobs to start (and restart, and restart)
There were small numbers of random failures, which had to be rerun. It would be good to track down what causes these. One example:

CameraMapper: Loading registry registry from /global/cscratch1/sd/tony_j/Twinkles/output/work/2/output/_parent/registry.sqlite3
Traceback (most recent call last):
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/meas_base/2016_01.0-20-g803754d+1/bin/forcedPhotCcd.py", line 24, in <module>
    ForcedPhotCcdTask.parseAndRun()
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/pipe_base/12.0+3/python/lsst/pipe/base/cmdLineTask.py", line 452, in parseAndRun
    resultList = taskRunner.run(parsedCmd)
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/pipe_base/12.0+3/python/lsst/pipe/base/cmdLineTask.py", line 194, in run
    if self.precall(parsedCmd):
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/pipe_base/12.0+3/python/lsst/pipe/base/cmdLineTask.py", line 284, in precall
    task.writeSchemas(parsedCmd.butler, clobber=self.clobberConfig, doBackup=self.doBackup)
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/pipe_base/12.0+3/python/lsst/pipe/base/cmdLineTask.py", line 527, in writeSchemas
    butler.put(catalog, schemaDataset, doBackup=doBackup)
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/butler.py", line 396, in put
    location.repository.write(location, obj)
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/repository.py", line 266, in write
    return self._access.write(butlerLocation, obj)
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/access.py", line 117, in write
    self.storage.write(butlerLocation, obj)
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/posixStorage.py", line 207, in write
    self.persistence.persist(obj, storageList, additionalData)
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/miniconda/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/safeFileIo.py", line 84, in SafeFilename
    setFileMode(name)
  File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/safeFileIo.py", line 48, in setFileMode
    os.chmod(filename, (~umask & 0o666))
OSError: [Errno 2] No such file or directory: '/global/cscratch1/sd/tony_j/Twinkles/output/work/2/output/schema/forced.fits'

tony-johnson commented 8 years ago

Some plots showing throughput.

nersc-timeline-full nersc-timeline-forcedphotometry nersc-timeline-eimage

tony-johnson commented 8 years ago

Declaring victory and closing issue.

drphilmarshall commented 8 years ago

Excellent! Great if you can talk us through these results when we meet at 12, as well.

On Thu, Jul 7, 2016 at 9:22 AM, Tony Johnson notifications@github.com wrote:

Declaring victory and closing issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-231131323, or mute the thread https://github.com/notifications/unsubscribe/AArY9wLQ0y3PXBodB_bH4ku0C9IZh4UDks5qTSfegaJpZM4H8VRD .

djbard commented 8 years ago

I'd like to join in your noontime meeting to hear how things went. What are the connection deets?

On Thu, Jul 7, 2016 at 10:54 AM, Phil Marshall notifications@github.com wrote:

Excellent! Great if you can talk us through these results when we meet at 12, as well.

On Thu, Jul 7, 2016 at 9:22 AM, Tony Johnson notifications@github.com wrote:

Declaring victory and closing issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-231131323 , or mute the thread < https://github.com/notifications/unsubscribe/AArY9wLQ0y3PXBodB_bH4ku0C9IZh4UDks5qTSfegaJpZM4H8VRD

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-231156541, or mute the thread https://github.com/notifications/unsubscribe/ABtSX-IGNJhsFgoOdflpt8ZrFhpZmuAJks5qTT1DgaJpZM4H8VRD .

drphilmarshall commented 8 years ago

I'll mention you on the reminder - hang on a tick

LSSTDESC / Twinkles

Test Workflow at NERSC #201