Closed drphilmarshall closed 8 years ago
Now that Run 1.1 is complete, we should focus on getting things running at NERSC, with a goal to have things working by the end of May. @djbard should we organize a follow-up (phone) meeting to discuss some of the issues that were left from the meeting we had in December, such as running jobs under group account? @brianv0 are there other issues that we need to discuss with NERSC to get job submission to NERSC running.
That's a good idea. I can't meet in the "usual" Twinkles slot tomorrow, but next week (perhaps Tuesday?) would work.
On Thu, Apr 28, 2016 at 8:58 AM, Tony Johnson notifications@github.com wrote:
Now that Run 1.1 is complete, we should focus on getting things running at NERSC, with a goal to have things working by the end of May. @djbard https://github.com/djbard should we organize a follow-up (phone) meeting to discuss some of the issues that were left from the meeting we had in December, such as running jobs under group account? @brianv0 https://github.com/brianv0 are there other issues that we need to discuss with NERSC to get job submission to NERSC running.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-215476449
Would Tuesday afternoon work, after 2pm (say 3pm to be concrete?).
Possible topics:
3pm will be great. Do you have a BlueJeans number to use?
On Mon, May 2, 2016 at 9:20 AM, Tony Johnson notifications@github.com wrote:
Would Tuesday afternoon work, after 2pm (say 3pm to be concrete?).
Possible topics:
- Throughput on serial queues
- Max job time (especially as it applies to phosim jobs)
- Group account for running jobs at NERSC #86 https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/86
- Submitting jobs directly to NERSC bypassing NEWT #87 https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/87
- Getting twinkles appropriate installs of phosim and DMStack at NERSC
- (add more topics here)
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-216282317
Hello: I talked to Shreyas today, who will probably be a useful person to have involved in this discussion since he's our NEWT expert. He's not available tomorrow however - he could meet on Friday before 12am, or in the 2.30 Twinkles slot. Perhaps we could have a half-meeting tomorrow to discuss the areas I can help with, and postpone the other half of the meeting until Friday? -Debbie
On Mon, May 2, 2016 at 10:42 AM, Deborah Bard djbard@lbl.gov wrote:
3pm will be great. Do you have a BlueJeans number to use?
On Mon, May 2, 2016 at 9:20 AM, Tony Johnson notifications@github.com wrote:
Would Tuesday afternoon work, after 2pm (say 3pm to be concrete?).
Possible topics:
- Throughput on serial queues
- Max job time (especially as it applies to phosim jobs)
- Group account for running jobs at NERSC #86 https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/86
- Submitting jobs directly to NERSC bypassing NEWT #87 https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/87
- Getting twinkles appropriate installs of phosim and DMStack at NERSC
- (add more topics here)
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-216282317
If there is a meeting today (Tuesday) - I realized I have something on at 3, but 4 would work... if that doesn't work for everyone else, just carry on.
Richard
OK, following Debbie's suggestion, we should meet this afternoon to discuss at least a subset of the topics listed above. Either 3pm or 4pm would work for me -- others should chime in if one time works better than the other. We can use https://bluejeans.com/256850342/. Everyone is welcome but I would suggest that useful attendees would include @brianv0 @TomGlanzman @djbard @jchiang87 @richardxdubois and perhaps @heather999 (although this meeting will be a little late on the east coast).
We can have another meeting on Friday with Shreyas, however we need to check Brian's availability. Brian will be away for 2 week following that, so getting this running by the end of May might be difficult.
I can only do 3pm this afternoon, I have other meetings at 4 (and in fact the rest of today). Let me know if I need to book a slot for me and Shreyas on Friday AM, @brianv0 https://github.com/brianv0.
On Tue, May 3, 2016 at 8:51 AM, Tony Johnson notifications@github.com wrote:
OK, following Debbie's suggestion, we should meet this afternoon to discuss at least a subset of the topics listed above. Either 3pm or 4pm would work for me -- others should chime in if one time works better than the other. We can use https://bluejeans.com/256850342/. Everyone is welcome but I would suggest that useful attendees would include @brianv0 https://github.com/brianv0 @TomGlanzman https://github.com/TomGlanzman @djbard https://github.com/djbard @jchiang87 https://github.com/jchiang87 @richardxdubois https://github.com/richardxdubois and perhaps @heather999 https://github.com/heather999 (although this meeting will be a little late on the east coast).
We can have another meeting on Friday with Shreyas, however we need to check Brian's availability. Brian will be away for 2 week following that, so getting this running by the end of May might be difficult.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-216573644
I was pondering attending and I think I can manage 3 PM Pacific. Thanks!
Sounds like the consensus is to meet at 3pm today. We can discuss possible follow up meeting on Friday at today's meeting.
Meeting starting in 5 minutes. Notes and agenda are here:
https://docs.google.com/document/d/1NnsW3XZTQcJNqVi-3RZQivCqSMlY753S2MZk9dsbMNg/edit?usp=sharing
Are we still planning to meet at 2:30 Pacific today?
Debbie had suggested we meet at 2:30 today to discuss NEWT specific issues with Shreyas Cholia. I have not heard a definitive reply as to whether he can make 2:30, I just sent him a reminder, will post here if the meeting is happening. We could also use the meeting to further discuss other NERSC issues if we have it.
We will have a meeting at 2:30 today, using https://bluejeans.com/256850342/.
It will focus on use of NEWT for submitting jobs from the workflow engine to NERSC. Topics include:
HI @djbard, I have tried running some jobs at NERSC by hand using the stack installed by @heather999 described in #242.
The good news is that if I submit a large number (~1200) of short (~00:30:00) jobs in the shared queue on cori they all seem to start pretty fast, in fact I think they all ran in about 2 hours.
The less good news is that f I submit only 10 jobs at a time (so far I am just running processEimage jobs) they run fine, but if I submit 100 jobs or more at a time they just appear to all hang and then die with:
slurmstepd: * JOB 2245087 ON nid02297 CANCELLED AT 2016-05-18T13:24:30 DUE TO TIME LIMIT *
I am writing the output to the scratch disk, so maybe they are just running very slowly due to disk contention? Are there any recommended tools for investigating issues like this (at SLAC for instance we can use ganglia to monitor disk loads, and tools like lsrun to look at the processes running on the batch nodes). Or documentation I should be reading? Or should I just contact the support desk?
Hi Tony: your best bet is absolutely (and is always!) to submit a ticket, including your batch scripts so they can look for problems with your syntax (that's pretty common with job arrays). You'll get a much faster and more comprehensive response than I can give you - NERSC consultants are great. But good news that your large array ran quickly! And that it can run at all, great that the stack is working now.
On Wed, May 18, 2016 at 5:14 PM, Tony Johnson notifications@github.com wrote:
HI @djbard https://github.com/djbard, I have tried running some jobs at NERSC by hand using the stack installed by @heather999 https://github.com/heather999 described in #242 https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/242.
The good news is that if I submit a large number (~1200) of short (~00:30:00) jobs in the shared queue on cori they all seem to start pretty fast, in fact I think they all ran in about 2 hours.
The less good news is that f I submit only 10 jobs at a time (so far I am just running processEimage jobs) they run fine, but if I submit 100 jobs or more at a time they just appear to all hang and then die with:
slurmstepd: * JOB 2245087 ON nid02297 CANCELLED AT 2016-05-18T13:24:30 DUE TO TIME LIMIT *
I am writing the output to the scratch disk, so maybe they are just running very slowly due to disk contention? Are there any recommended tools for investigating issues like this (at SLAC for instance we can use ganglia to monitor disk loads, and tools like lsrun to look at the processes running on the batch nodes). Or documentation I should be reading? Or should I just contact the support desk?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-220194865
Thanks @djbard! And you Tony, for pushing on this. I guess submitting jobs via the workflow engine is the next step, while the NERSC consultants chew on your log files: is #87 the right issue to work on?
Tony, it could be that you can tick off some of your checkboxes by closing some of those issues. I feel like you deserve some issue closure.
Before contacting NERSC consultants I decided to contact DM support to see if they have any suggestions first:
https://community.lsst.org/t/disk-contention-when-running-many-dm-jobs-advice-needed/791
I'd strongly recommend asking the consultants anyway @tony-johnson - that's what they're there for! I am quite sure you have an issue with your scripts, not with $SCRATCH contention. You can check the CSCRATCH performance here: https://my.nersc.gov/filesystems-cs.php
Impossible -- I never have issues with my scripts! But OK I have submitted an issue, I don't know if you @djbard or anyone else can see it?
ha! Yes, I can see that (and am "watching" the ticket) but noone else will be able to. At the least, by filing the ticket we're raising awareness of how this project will be running jobs.
Status update:
After moving all scripts, input and output data, and log files to the scratch disk, have succeeded in running all of the 1.1 eimage jobs at NERSC, although still with very low efficiency (~4% CPU/elapsed). Not sure keeping everything on scratch is a long-term solution, but there are somme other suggestions from DM and NERSC consultants to try (time permitting).
For now I plan to use the eimage output to try re-running the Run 1.1 coadd jobs at NERSC.
Cori 06/11/16 4:30-06/27/16 8:00 PDT Scheduled maintenance. Cori will be unavailable while preparations are made for the installation of Cori Phase II. Login nodes and job submissions will be unavailable for the duration of the maintenance. The entire two week window is reserved, although the system may be available sooner depending on the progress of the maintenance.
FYI: I've gotten a shared partition installed on Edison for the duration of the outage (and perhaps for the rest of the summer, until after the Cori P1/P2 integration). It has 1300MB/core (so less than the Cori shared partition) but it will be useful.
That sounds very useful @djbard, would we need a fresh DM installation for Edison?
I think you would need an Edison-specific installation, @heather999. But it should be the same installation process. Another thing to note - the upcoming downtime on Cori is to install a new OS, so the DM stack (and indeed all code) may need to be recompiled after it. We've been told that everything should recompile just fine, no changes necessary...
Thanks, @djbard, yes I suspect this should be very straightforward to re-install. @tony-johnson is it ok if I get to that tonight or tomorrow? Or are you hoping to hop onto Edison sooner?
@heather999 , I am not entirely sure I have plans to hop onto edison at all, because next week I will be busy with the camera workshop at SLAC, and I am not sure that we are ready to run Twinkles run 2. I think this should be a topic for tomorrow's twinkles meeting.
Yes - I added an agenda item at the top of the agenda. @djbard we'll report back here afterwards - you're welcome to join us at 12noon PDT as well, of course! :-)
Run 2 (== Run 1.1 but at NERSC) is now complete (at last). Will add details (and close issue) tomorrow.
More details on Run 2. The stream is completed, the pipeline task can be found here:
http://srs.slac.stanford.edu/Pipeline-II/exp/SRS/si.jsp?stream=41843928
Note that the links to the log files require the dev pipeline web interface so they do not work from the above link. ( @brianv0 can we move those changes to prod). The output from the jobs can be found here:
/global/cscratch1/sd/tony_j/Twinkles/output/work/2/output
Maybe this should be moved to a more permanent location.
Although the jobs eventually completed there are a number of outstanding issues:
CameraMapper: Loading registry registry from /global/cscratch1/sd/tony_j/Twinkles/output/work/2/output/_parent/registry.sqlite3
Traceback (most recent call last):
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/meas_base/2016_01.0-20-g803754d+1/bin/forcedPhotCcd.py", line 24, in <module>
ForcedPhotCcdTask.parseAndRun()
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/pipe_base/12.0+3/python/lsst/pipe/base/cmdLineTask.py", line 452, in parseAndRun
resultList = taskRunner.run(parsedCmd)
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/pipe_base/12.0+3/python/lsst/pipe/base/cmdLineTask.py", line 194, in run
if self.precall(parsedCmd):
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/pipe_base/12.0+3/python/lsst/pipe/base/cmdLineTask.py", line 284, in precall
task.writeSchemas(parsedCmd.butler, clobber=self.clobberConfig, doBackup=self.doBackup)
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/pipe_base/12.0+3/python/lsst/pipe/base/cmdLineTask.py", line 527, in writeSchemas
butler.put(catalog, schemaDataset, doBackup=doBackup)
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/butler.py", line 396, in put
location.repository.write(location, obj)
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/repository.py", line 266, in write
return self._access.write(butlerLocation, obj)
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/access.py", line 117, in write
self.storage.write(butlerLocation, obj)
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/posixStorage.py", line 207, in write
self.persistence.persist(obj, storageList, additionalData)
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/miniconda/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/safeFileIo.py", line 84, in SafeFilename
setFileMode(name)
File "/global/common/cori/contrib/lsst/lsstDM/w.2016.20/lsstsw/stack/Linux64/daf_persistence/12.0+1/python/lsst/daf/persistence/safeFileIo.py", line 48, in setFileMode
os.chmod(filename, (~umask & 0o666))
OSError: [Errno 2] No such file or directory: '/global/cscratch1/sd/tony_j/Twinkles/output/work/2/output/schema/forced.fits'
Some plots showing throughput.
Declaring victory and closing issue.
Excellent! Great if you can talk us through these results when we meet at 12, as well.
On Thu, Jul 7, 2016 at 9:22 AM, Tony Johnson notifications@github.com wrote:
Declaring victory and closing issue.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-231131323, or mute the thread https://github.com/notifications/unsubscribe/AArY9wLQ0y3PXBodB_bH4ku0C9IZh4UDks5qTSfegaJpZM4H8VRD .
I'd like to join in your noontime meeting to hear how things went. What are the connection deets?
On Thu, Jul 7, 2016 at 10:54 AM, Phil Marshall notifications@github.com wrote:
Excellent! Great if you can talk us through these results when we meet at 12, as well.
On Thu, Jul 7, 2016 at 9:22 AM, Tony Johnson notifications@github.com wrote:
Declaring victory and closing issue.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-231131323 , or mute the thread < https://github.com/notifications/unsubscribe/AArY9wLQ0y3PXBodB_bH4ku0C9IZh4UDks5qTSfegaJpZM4H8VRD
.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/201#issuecomment-231156541, or mute the thread https://github.com/notifications/unsubscribe/ABtSX-IGNJhsFgoOdflpt8ZrFhpZmuAJks5qTT1DgaJpZM4H8VRD .
I'll mention you on the reminder - hang on a tick
An epic task indeed! Not sure when the best time to do this is, but Run 1.1 could be a good opportunity, at least after it has been executed successfully at SLAC. If we are hoping to do Run 2 at NERSC, this could be the way to go.
Things that need doing are:
There are probably others, that we can add as we go. Assigning oversight of this to Tony!