LSSTDESC / Twinkles

10 years. 6 filters. 1 tiny patch of sky. Thousands of time-variable cosmological distance probes.
MIT License
13 stars 12 forks source link

Run3 phoSim production #369

Closed TomGlanzman closed 7 years ago

TomGlanzman commented 7 years ago

This is where news of the Twinkles Run3 phoSim production will appear.

First news item:

The first 40 run3 test runs are happening now. The SLAC Pipeline task is TW-phoSim-r3 and progress may be monitored at this link. At this moment, 19 of 40 are complete; the remaining runs should be complete by Monday morning.

This task includes the latest dynamically generated instanceCatalogs and SED files from Rahul. It also makes use of the SLAC lustre high-performance file system in an effort to avoid the I/O issues observed in run1 (due to phoSim's creation of its /work directory).

These first 40 are tests. It would not be unreasonable to redo them if a configuration or other issue arises. Data may be searched in the data catalog: http://srs.slac.stanford.edu/DataCatalog/folder.jsp?folder=16123472, or directly at SLAC, /nfs/farm/g/desc/u1/Pipeline-tasks/TW-phoSim-r3/phosim_output. From there, dig down into the subdirectories to the "output" directory and you will find two files per visit: lsste<visit#>fN_E000.fits.gz and the associated centroid file.

Please post operational questions/concerns here.

rbiswas4 commented 7 years ago

Is there a way to see the instance catalogs ?

sethdigel commented 7 years ago

Is there a way to see the instance catalogs ?

They seem to be in subdirectories of /lustre/ki/pfs/fermi_scratch/lsst/TW-phoSim-r3/singleSensor/ I am not sure of the mapping scheme for the directory names but the instance catalog for ObsHistID 250 is in /lustre/ki/pfs/fermi_scratch/lsst/TW-phoSim-r3/singleSensor/25/0 I think that this is the scratch area that nominally is temporary but it looks like Tom has set the pipeline to leave these files in place, I guess for now.

TomGlanzman commented 7 years ago

Seth is correct. The basic naming convention is:

/lustre/ki/pfs/fermi_scratch/lsst//<(sub)taskName>//

In these directories, you will find:

instanceCatalog.txt (output of generatePhosimInputs) spectra_files/ (output of generatePhosimInputs) SEDs/ (collection of sym links to actual SED files and directories) work/ (phoSim's normal work directory)

Lustre should be mounted on all SLAC public login and batch machines, as well as selected others. However, I have run across examples where this is not true and have resorted to contacting unix-admin. Let me know if you cannot access this area (and include the hostname from which you are trying).

This area is intended to be semi-persistent (for the duration of the visit processing only) but for the purposes of testing and validation, I have disabled the normal cleanup.

The needed mapping is between visit number (obsHistID) and streamNumber. This file contains a list of all obsHistIDs, in order, the first corresponding to streamNumber 0, followed by 1,2,3,...

/nfs/farm/g/desc/u1/Pipeline-tasks/TW-phoSim-r3/config/twinkles_visits.txt

The first forty visits/obsHistIDs are:

230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269

Note that this file was generated last week and before any special visit ordering was decided.

TomGlanzman commented 7 years ago

All 40 test runs are complete. (A single run ran amok and was killed after an anomalous 38 hour run, and then restarted.)

jchiang87 commented 7 years ago

@TomGlanzman Will the output from those runs remain in the lustre locations or will they be copied or moved to a longer term archiving area?

TomGlanzman commented 7 years ago

@jchiang87 The physics output, "e" files and centroid files, are (and will remain) in persistent DESC NFS space, /nfs/farm/g/desc/u1/Pipeline-tasks/TW-phoSim-r3/phosim_output. The intermediate files stored in lustre are handled as follows:

1) phoSim /work directory - automatically cleaned up by phoSim. I do not know of a way to preserve its contents.

2) instanceCatalog and sprinkled SED files - intended to be completely removed at phoSim completion, but currently these are being kept for test/validation purposes. There is no hurry to clean up the initial 40 test runs' data. And there is currently no plan to move these data to a more permanent location. If this is not the desired policy, let me know.

jchiang87 commented 7 years ago

Actually, I did mean the inputs, i.e., the instance catalogs, etc.. If they will be available for a week or so, I guess that's fine.

TomGlanzman commented 7 years ago

They can be left for as long as you (or anyone) would like. If lustre space becomes an issue, they can be moved elsewhere. Are you also suggesting that in the future, these inputs should be preserved for some period of time?

jchiang87 commented 7 years ago

Not sure. I've found it useful to look back at the instance catalogs for Run1.1 (i.e., to debug the altitude problem for phosim 3.4.2), so I think it would be good to archive those. I'm not sure it is worth saving the associated SED files, though.

jchiang87 commented 7 years ago

Using v12_1 of the Stack, I've run all 40 visits through processEimage.py. They all finished and gave reasonable zeropoints and seeing values. FWIW, here's a comparison of the measured seeing from calexp vs the raw seeing from the minion_1016 db file: seeing_vs_rawseeing_run3_preview I guess this seems reasonable (or at least similar to what we saw with Run1.1).

Here is a log of the screen output from the processEimage.py runs: processEimage.txt

So given this, I'd say the phosim data generation part is probably good to go.

For the rest of the Level 2 pipeline tasks, there may be issues. I also ran the post-processEimage.py tasks, and they seemed to execute ok until mergeCoaddMeasurements.py:

  mergeCoaddMeasurements.py output_repo/ --id filter=g^r^i tract=0 --output output_repo --doraise --clobber-config
root INFO: Loading config overrride file '/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/obs_lsstSim/12.1-1-g8d21232+5/config/mergeCoaddMeasurements.py'
root INFO: Config override file does not exist: u'/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/obs_lsstSim/12.1-1-g8d21232+5/config/lsstSim/mergeCoaddMeasurements.py'
root INFO: input=/nfs/farm/g/desc/u1/users/jchiang/desc_projects/twinkles/Run3_precursor/output_repo
root INFO: calib=None
root INFO: output=/nfs/farm/g/desc/u1/users/jchiang/desc_projects/twinkles/Run3_precursor/output_repo
CameraMapper INFO: Loading registry registry from /nfs/farm/g/desc/u1/users/jchiang/desc_projects/twinkles/Run3_precursor/output_repo/_parent/registry.sqlite3
Traceback (most recent call last):
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/pipe_tasks/12.1+4/bin/mergeCoaddMeasurements.py", line 3, in <module>
    MergeMeasurementsTask.parseAndRun()
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/pipe_base/12.1+1/python/lsst/pipe/base/cmdLineTask.py", line 472, in parseAndRun
    resultList = taskRunner.run(parsedCmd)
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/pipe_base/12.1+1/python/lsst/pipe/base/cmdLineTask.py", line 201, in run
    if self.precall(parsedCmd):
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/pipe_base/12.1+1/python/lsst/pipe/base/cmdLineTask.py", line 299, in precall
    task = self.makeTask(parsedCmd=parsedCmd)
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/pipe_tasks/12.1+4/python/lsst/pipe/tasks/multiBand.py", line 324, in makeTask
    return self.TaskClass(config=self.config, log=self.log, butler=butler)
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/pipe_tasks/12.1+4/python/lsst/pipe/tasks/multiBand.py", line 1265, in __init__
    inputSchema = self.getInputSchema(butler=butler, schema=schema)
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/pipe_tasks/12.1+4/python/lsst/pipe/tasks/multiBand.py", line 422, in getInputSchema
    immediate=True).schema
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/daf_persistence/12.1/python/lsst/daf/persistence/butler.py", line 614, in get
    return callback()
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/daf_persistence/12.1/python/lsst/daf/persistence/butler.py", line 609, in <lambda>
    callback = lambda: self._read(location)
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/daf_persistence/12.1/python/lsst/daf/persistence/butler.py", line 694, in _read
    results = location.repository.read(location)
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/daf_persistence/12.1/python/lsst/daf/persistence/repository.py", line 149, in read
    return self._storage.read(butlerLocation)
  File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/Linux64/daf_persistence/12.1/python/lsst/daf/persistence/posixStorage.py", line 268, in read
    raise RuntimeError("No such FITS catalog file: " + logLoc.locString())
RuntimeError: No such FITS catalog file: /nfs/farm/g/desc/u1/users/jchiang/desc_projects/twinkles/Run3_precursor/output_repo/schema/deepCoadd_meas.fits

Here is the full log for the post-processEimage tasks: post_processEimage_tasks.txt

I haven't tried to dig into the output log yet to understand things, but maybe @SimonKrughoff you can see something obviously wrong with how I ran the preceding tasks?

The Level 2 output is at slac at

/nfs/farm/g/desc/u1/users/jchiang/desc_projects/twinkles/Run3_precursor/output_repo
jchiang87 commented 7 years ago

It seems that the problem is that in version w.2016.40 (aka v12_1) of the Stack, the tasks measureCoaddSources.py and forcedPhotCcd.py no longer have the config option measurement.doApplyApCorr. If I remove that option from the command line config, both tasks (and the intervening one) seem to run successfully. We need to update our workflow scripts with this change.

There are no release notes for the v12_1 release or any other documentation I could find describing this change.

jchiang87 commented 7 years ago

Here is the photometric repeatability plot from plot_point_mags.py: run3_precursor_repeatability

TomGlanzman commented 7 years ago

@drphilmarshall @jchiang87 @sethdigel @rbiswas4 and other interested parties:

The Twinkles phoSim run3 workflow scripts and auxiliary files have been updated in github as: Twinkles/workflows/TW-phoSim-r3 (branch issue/315/Run3phoSimWorkflow). A short narrative describing this workflow is available in googleDocs. In combination with the 40 test runs, this should be sufficient to begin a review process. Please let me know if there are questions.

sethdigel commented 7 years ago

The writeup in the Google document is very nice. I'd recommend eventually migrating it to the Twinkles repository.

I noticed that the instance catalogs cover a region with 0.3 deg radius. This is much bigger than it needs to be for a single sensor (~0.16 deg radius would do, 0.18 deg would be ample since bright stars are not present). I was wondering if there's a driver for 0.3 deg radius. There must be some performance penalty in generating the instance catalogs, and trimming them for each phosim run.

jchiang87 commented 7 years ago

0.3 deg is a number we've carried along from the very beginning. Part of the original motivation for having so large a region may have to do with the generation of the index files for the astrometry.net code which have been derived from the instance catalogs, but I'm just guessing. In any case, I don't think it's a big deal. With the local caching of the galaxies, it takes ~4-5 mins to generate an instance catalog and even less time to run the trim program. Compared to 9 hour phosim/raytrace runs, it's a very minor consideration.

drphilmarshall commented 7 years ago

Thanks Tom, that's great. As you say, let's look through what you have done as part of the review, which we can aim to finish off in our meeting tomorrow. Thanks also Jim for the repeatability plot: things are looking good, it seems. Do the images pass a visual inspection? (For divots, weirdness etc)

jchiang87 commented 7 years ago

I only looked a few images directly, but they seemed a-ok.

jchiang87 commented 7 years ago

@TomGlanzman I've been analyzing the instance catalogs and SED files in

/lustre/ki/pfs/fermi_scratch/lsst/TW-phoSim-r3/singleSensor

but the files have disappeared. Have they been moved somewhere else?

TomGlanzman commented 7 years ago

@jchiang87 , no, I've not touched those directories since the forty jobs completed early last week. The files were present on Thursday when we had our Twinkles meeting. I do not know what might have happened. Interestingly, some directories appear to have changed recently, e.g.,

(Sun 15:52) dragon@comet (bash) $ pwd /lustre/ki/pfs/fermi_scratch/lsst/TW-phoSim-r3/singleSensor/39/0 (Sun 15:54) dragon@comet (bash) $ ls -l total 48 drwxrwsr-x 4 lsstsim glast-pipeline 4096 Oct 30 01:55 ./ <-- directory modified drwxrwsr-x 3 lsstsim glast-pipeline 4096 Oct 22 14:03 ../ drwxrwsr-x 15 lsstsim lsst 4096 Oct 23 08:18 SEDs/ drwxrwsr-x 2 lsstsim glast-pipeline 36864 Oct 23 23:20 work/

It appears that only the output of the generatePhosimInput.py (instance catalog and sprinkled SED files) have disappeared.

I can arrange for those files to be recreated if you need them.

sethdigel commented 7 years ago

I'd also be interested to have the instance catalogs back.

TomGlanzman commented 7 years ago

Election Day update: 1) After a couple of start-up hiccups, a set of 40 Twinkles phoSim has been submitted, many of which have completed. Monitor the workflow progress here: http://srs.slac.stanford.edu/Pipeline-II/exp/LSST-DESC/task.jsp?refreshRate=60&task=41901009

2) I have created a new Twinkles branch, issue/360/Run3_phoSim_production. A copy of this branch is being used to run the workflow task. This allows me to easily make mid-course operational corrections, such as adjusting the rate at which new visits pace themselves at start up to avoid I/O overload.

3) Jim kindly created the first Twinkles "release", Run3-phosim-v1, from which external tools will be used, e.g., creating the sorted visit list and the instanceCatalog generation.

4) For the moment, all transient data created to simulate the visit, instance catalog and SEDs, are being preserved. At some point (after the first ~1000 visits?), I will re-enable the cleanup of these files. The preserved files look like this:

-rwxr-sr-t 1 lsstsim glast-pipeline 63287962 Nov 7 19:40 instanceCatalog.txt*

which will hopefully prevent any future mysterious disappearance acts. Please let me know asap if you think any files have gone missing.

5) Where is the data?

Transient data: /lustre/ki/pfs/fermi_scratch/lsst/TW-phoSim-r3/singleSensor//0

phoSim output: /nfs/farm/g/desc/u1/Pipeline-tasks/TW-phoSim-r3/phosim_output//R22_S11/output

the contents of which looks like, e.g., -rw-rw-r-- 1 lsstsim desc 3504319 Nov 8 00:11 centroid_lsst_e_230_f2_R22_S11_E000.txt -rw-rw-r-- 1 lsstsim desc 25818137 Nov 8 00:11 lsst_e_230_f2_R22_S11_E000.fits.gz

Note that the phoSim output contains the visit (obsHistID) number.

One can also search the dataCatalog, http://srs.slac.stanford.edu/DataCatalog/folder.jsp?folder=16208097 which presents you with a list of streamIDs. Click on a stream, then the sensorID to see the list of phoSim output files. Clicking on a file will download it to your computer.

6) Please have a look at these data. I will slowly ramp up production in the meantime. Let me know asap if you find something amiss.

TomGlanzman commented 7 years ago

Black Wednesday update: The phoSim generation task continues to slowly ramp up as I monitor impact on critical infrastructure servers. Nearly 200 visits are complete. Early timing numbers show a mean wall clock time of nearly 10 hours. Am aiming for 2000 concurrently running jobs on the SLAC batch farm. There have been a surprising number of transient failures: mostly inability to load certain python modules, or not finding /lustre -- these will require more investigation and, hopefully, mitigation.

TomGlanzman commented 7 years ago

Further update: The /lustre problem has tentatively been traced to the way batch machines load the lustre module. A proposed change is to load the module at start-up rather than on demand. This should take place after tonight's routine reboot of the batch farm.

The max memory footprint for this Twinkles batch seems to be varying between 2.4 and 3.1 GB.

TomGlanzman commented 7 years ago

Production update: The problem with /lustre turns out to be a limitation with the metadata server's available space. One instance requires a lot of SED files (about 16k), about 12k 'production' SEDs + 4k 'custom' (sprinkled) SEDs. In order to avoid making a separate copy of the 12k production SEDs, a tree of sym links is created to them. Unfortunately, these all count against the inode (or Lustre equivalent) limit. We are working on a solution/work-around. One consequence of this problem is that a large number of the existing jobs appear to have silently failed to write their output files (only about 360 have been produced). phosim.py is known to fail silently for certain conditions (e.g, when no appropriate SED file is found), so I will try to figure out a way to catch this condition. In the meantime, production is on hold, and some number of formerly "successful" runs will be resubmitted.

TomGlanzman commented 7 years ago

Production update: A likely solution to the lustre troubles has been implemented in the phoSim workflow and is currently being tested. Given the possibility that previous visits may have become corrupted due to the lustre issues, I propose starting Run3 over from scratch. Please let me know if there are objections.

In addition to the lustre-motivated changes, I will also rebuild the visit list per issue #406

drphilmarshall commented 7 years ago

Excellent news! Well done, Tom :-)

Restarting Run 3 is fine by me. I'll merge Rahul's pull request ( #406 ) now.

TomGlanzman commented 7 years ago

The Lustre issue was described in #410 and motivated PR #411. Beginning the countdown to restarting...

TomGlanzman commented 7 years ago

New version of Twinkles, Run3-phoSim-v2, installed and tested at SLAC. Workflow has restarted.

TomGlanzman commented 7 years ago

Twinkles visit list for the phoSim workflow has been regenerated using the recently merged master branch (commit 75eaf1ed7187d9118e741e78cdd5e6e79940d3cb) and minion_1016_sqlite.db.

Run #Visits
3.1 1508
3.1b 329
3.2 2104
3.3 18029

Total 21970

The workflow has been restarted (from the very first visit) using xml v1.5. Current status: 227 visits complete, 950 submitted. Task may be monitored here: http://srs.slac.stanford.edu/Pipeline-II/exp/LSST-DESC/task.jsp?task=41927463

TomGlanzman commented 7 years ago

All 1508 visits of Run3.1 have been submitted. I plan to continue ramping up the production so will not pause at this point. Currently, 381 visits complete, another ~1200 running.

TomGlanzman commented 7 years ago

Thursday morning update: 2081 visits have completed. It would be good if a few others could begin to look at the output and confirm the data are acceptable.

Op notes: there have been a number of transient infrastructure failures, such as inability to load python modules; inability to access certain disk areas; one instance of a badly written tar file (SEDs); failure of Twinkles setup.sh to run properly for various reasons; etc. A simple rollback generally works. This morning marks the end of job submission for Run 3.2, so have updated trickleStream to begin submitting the final ~18,000 visits of Run 3.3. Finally, have seen ~2500 concurrent jobs running without undue infrastructure stress.

TomGlanzman commented 7 years ago

Op notes follow-up: A transient failure pattern is beginning to appear. Somewhere in the environment setup to run the instanceCatalog/SEDs generation step, explicit file locks are being written into the DM stack installation directory, e.g.,

/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1

where jobs want to create files like, .lockDir/shared-lsstsim.26192. I am seeing quite a few failures where these files are expected but not found. Does someone reading this happen to be familiar with this locking activity?

heather999 commented 7 years ago

I'm not familiar with this particular use of DMstack - but there was a missing magic line in: /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/DMstack/w.2016.40-sims_2.3.1/site/startup.py

I've now added: hooks.config.site.lockDirectoryBase = None I'm not sure if this will help :) but it might for those jobs that have not yet started up.. Is there a way we can tell if this helps to avoid such failures? I'm a little confused why this would be transient - but perhaps the jobs are competing for locks.. with locks turned off, perhaps this will no longer be an issue.

sethdigel commented 7 years ago

Looking at the first visit (obsHistId = 230), the agreement between the coordinates of sources in the input instance catalog and the output centroid file has not changed - not that we'd expect it to.

TomGlanzman commented 7 years ago

Friday morning update: 5011 visits complete and only three failures overnight! This represents nearly 23% of the Run3 visits.

Op notes: Thanks to Heather and her file locking fix, the bulk of instance catalog generation failures have completely disappeared. There remains a low (and somewhat troubling) level of system CPU activity on the DM file server but we can live with that. A second change in the way the production SED library directories are sym-linked to local /scratch space has almost completely eliminated the annoying I/O errors associated with creating 12,000 links. What remains is a very rare problem that SED.tar.gz files are not created properly which causes a subsequent job step to fail. Am looking into a way to validate tar files.

Finally, starting tomorrow and continuing for about 10 days, the level of attention and tending I will give to this task will reduce from "quite a bit" to "very little" due to travel and the holiday.

drphilmarshall commented 7 years ago

A well-deserved break! Thanks so much for getting the engine working so well, it's great to see it whirring away. And what a great way for a software robot to spend Thanksgiving :-)

TomGlanzman commented 7 years ago

Saturday morning update: 8512 visits complete, 2500 running and with only a single overnight transient failure. This represents 39% of Run 3 visits complete.

TomGlanzman commented 7 years ago

Sunday morning Run3 phoSim generation update: 12192 (55%) visits complete , 2500 running plus 4 transient overnight failures.

TomGlanzman commented 7 years ago

Monday morning update: 15023 (68%) visits complete, 2500 running + 2 transient failures.

TomGlanzman commented 7 years ago

Tuesday morning update: 17061 (78%) visits complete. Several bits of news from the past 24 hours. 1) An operational change (to the ~lsstsim/.eups directory) made to address a rare failure went wrong and caused ~4400 phoSimPrep jobs to fail. This has been reverted and all jobs being requeued; 2) We are starting to see some legitimate phoSim job time-outs. The number of such jobs so far is a mere 13. All jobs ran on fell-class batch nodes, among the oldest in the farm so I may roll them back to see if they will land on faster machines (recall that job queue limits are based on wall clock time rather than CPU time); 3) all 21,970 visits have been submitted.

TomGlanzman commented 7 years ago

A recent issue has come up: phoSim is failing multiple runs due to its inability locate the following SED:

starSED/wDs/bergeron_4000_80.dat_4200.gz

Still need to investigate...

danielsf commented 7 years ago

Do we know that PhoSim has successfully run on an InstanceCatalog with white dwarfs in it before? I ask because, looking at that file (and, indeed, all of the white dwarf SED files), I see that somehow the bergeron_ SED files got mangled so that the first line of useful information is in the header (clearly some \n were missing from the script that generated these SEDs). I don't know* that this would cause PhoSim to choke, but it might.

TomGlanzman commented 7 years ago

Hi Scott, A bit more digging has revealed that the sims_sed_library repository has been gutted. This may be due to an automatic 7-day cleanup on the Lustre file system where they are being stored. I am thinking this may explain all failures.

danielsf commented 7 years ago

{wipes the sweat from his forehead} (though only half of it, because it's still ridiculous that the Bergeron SEDs are formatted the way that they are...)

TomGlanzman commented 7 years ago

Wednesday morning update: 17755 (81%) of visits complete. Still slowly working off the failures due to the .eups directory issue (there is no automated rollback mechanism, so this is all done by hand, 50 visits at a time...). The Mystery of the Disappearing SEDs has been solved and so some dedicated LSSTDESC space has been requested that is not automatically cleaned up after 7 days. In the meantime, I have cheated a bit in order to prevent early Run3 instanceCatalogs and SEDs from being deleted.

drphilmarshall commented 7 years ago

Bravo, Tom! I am thoroughly impressed by the speed at which this data is getting cranked out. Thank you for doing the manual pipeline operation! Great stuff.

On Wed, Nov 23, 2016 at 8:05 AM, Tom Glanzman notifications@github.com wrote:

Wednesday morning update: 17755 (81%) of visits complete. Still slowly working off the failures due to the .eups directory issue (there is no automated rollback mechanism, so this is all done by hand, 50 visits at a time...). The Mystery of the Disappearing SEDs has been solved and so some dedicated LSSTDESC space has been requested that is not automatically cleaned up after 7 days. In the meantime, I have cheated a bit in order to prevent early Run3 instanceCatalogs and SEDs from being deleted.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/369#issuecomment-262557031, or mute the thread https://github.com/notifications/unsubscribe-auth/AArY93v1OqqG9FX-2rNc4xIFCS-uwqy-ks5rBGQ5gaJpZM4KeUxB .

TomGlanzman commented 7 years ago

U.S. Thanksgiving Day update: 18628 visits complete (85%). Still (slowly) working off the large block of earlier failures. No new problems have arisen. A block of dedicated Lustre file space has been allocated for (future) LSSTDESC use.

TomGlanzman commented 7 years ago

Black Friday and Saturday update: 20171 visits complete (92%). Living up to its name, yesterday proved to be a disaster. Due to some apparent network issues I inadvertently rolled back a large number of batch jobs which put a huge stress on wain025, the LSST file server, knocking it into a mode requiring computer center intervention. The situation will likely resolve only next week, so for the moment, the remaining visit simulations are on hold...

TomGlanzman commented 7 years ago

Tuesday 11/29 update: 21250 visits complete (97%). All of the failures from last week have been resubmitted and there are a mere 705 jobs still running - and these are all long-running jobs. To date, 15 legitimate time-outs have occurred; if that number does not grow significantly, we may decide to simply let them be until checkpointing or multi-threading is functional.

drphilmarshall commented 7 years ago

Great, thanks Tom! Are the "long-running jobs" in Run 3.4 (i.e., identified as long-running before you submitted them) or an earlier Run (but mis-identified as needing less than 100 hours by the PhoSimPredictor)? I guess the Run 3.3 and 3.4 jobs that either time out or are not even submitted need to be kept together so that when multi-threading becomes available we can run on them straight away. (We'll need to identify a small test set of visits that did complete as well, so that we can compare the multithreaded code results with the November 2016 version of PhoSim, but I guess we can define that any time.)

While we wait for the release of the next version of PhoSim, do we have access to any machines that do not have time limits that we can run on? 200 hours is only 8 days, I've run jobs on desktops for longer than that. If we have hundreds to do that could be a problem, but maybe the WFD subset could be attempted?

On Tue, Nov 29, 2016 at 7:31 AM, Tom Glanzman notifications@github.com wrote:

Tuesday 11/29 update: 21250 visits complete (97%). All of the failures from last week have been resubmitted and there are a mere 705 jobs still running - and these are all long-running jobs. To date, 15 legitimate time-outs have occurred; if that number does not grow significantly, we may decide to simply let them be until checkpointing or multi-threading is functional.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LSSTDESC/Twinkles/issues/369#issuecomment-263602025, or mute the thread https://github.com/notifications/unsubscribe-auth/AArY9-DAww146P1RR3mEe0sXr2DIjLuRks5rDEU0gaJpZM4KeUxB .