LSSTDESC / Twinkles

10 years. 6 filters. 1 tiny patch of sky. Thousands of time-variable cosmological distance probes.
MIT License
13 stars 12 forks source link

Benchmark PhoSim using agreed-upon physics override file using samples of instance catalogs for all six bands #91

Closed TomGlanzman closed 8 years ago

TomGlanzman commented 8 years ago

This is an attempt to characterize and quantify the execution time needed by phoSim for various, realistic invocations. An early benchmark using the first of Simon's provided instanceCatalogs ran in 66h 4m on a SLAC machine (one core).

johnrpeterson commented 8 years ago

please send me the catalog/command files, so i can check this. nothing should take 66 hours.

john

On Jan 11, 2016, at 1:36 PM, Tom Glanzman notifications@github.com wrote:

This is an attempt to characterize and quantify the execution needed by phoSim for various, realistic invocations. An early benchmark using the first of Simon's provided instanceCatalogs ran in 66h 4m on a SLAC machine (one core).

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91.

SimonKrughoff commented 8 years ago

@johnrpeterson I think the catalog @TomGlanzman is running is the one this one.

For the record, I tried simulating this catalog and it took so long I eventually killed it. Since then, I've been running with a version of this file with the brightest (< 13) stars stripped out.

TomGlanzman commented 8 years ago

I am sorry, but I sent John a link to the instance catalog which is almost the one Simon mentioned: phosim_input_840_stripped.txt.

On 1/12/2016 2:19 PM, SimonKrughoff wrote:

@johnrpeterson https://github.com/johnrpeterson I think the catalog @TomGlanzman https://github.com/TomGlanzman is running is the one this one https://lsst-web.ncsa.illinois.edu/%7Ekrughoff/data/phosim_input_840.txt.gz.

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171078352.

SimonKrughoff commented 8 years ago

@TomGlanzman Thanks. I forgot that I sent you the stripped ones. That means the bright stars have been removed. For me this file takes about 6 hours to run on my MacBook Pro.

TomGlanzman commented 8 years ago

6 hours? Hmmm. I am wondering whether phoSim does not build optimized by default? On a Linux rhel6-64 machine, that instanceCatalog required 66 hours of time.

On 1/12/2016 2:47 PM, SimonKrughoff wrote:

@TomGlanzman https://github.com/TomGlanzman Thanks. I forgot that I sent you the stripped ones. That means the bright stars have been removed. For me this file takes about 6 hours to run on my MacBook Pro.

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171086410.

TomGlanzman commented 8 years ago

Looks like opt is enabled, so that's probably not the issue. An example compile looks like this: g++ -g -O3 -ffast-math -Wall -c .....

I did notice that only a single core was being used by phoSim during the execution, Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz, and the process was not memory bound.

I also noticed signs of set faulting in the log, and have forwarded the output to John for comment.

SimonKrughoff commented 8 years ago

I was only using one core too.

johnrpeterson commented 8 years ago

i think i may know the problem. the objects are spread over 9 chips, so i think you are simulating multiple chips in serial to get the 66 hours. if you just want the central chip you should be using: phosim catalog -s R22_S11 -c commandfile

is it true?

On Jan 12, 2016, at 7:01 PM, SimonKrughoff notifications@github.com wrote:

I was only using one core too.

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171105079.

TomGlanzman commented 8 years ago

I did not use the "-s R22_S11" option, but sounds like I should be. (Simon, did you use that option in your testing?)

Thanks John.

TomGlanzman commented 8 years ago

John, am assuming R22_S11 is Raft row2/col2, Sensor row1,col1, where rows and columns range 0-4 for rafts and 0-2 for sensors? This would then be the center sensor on the center raft.

cwwalter commented 8 years ago

Here is a reference figure

screen shot 2016-01-13 at 1 26 37 pm
SimonKrughoff commented 8 years ago

@TomGlanzman I did use the "-s R22_S11" option. Sorry, I should have pointed that out in my cookbook.

The catalogs are generated such that they will cover the central chip at any rotation angle. That means that the 9 chips in the central raft will have some coverage, but will not be useful for doing Twinkles.

johnrpeterson commented 8 years ago

note that since the background can be the majority of the simulation time, then the 66 hours/ 9 gets it back to a reasonable number, if that wasn’t clear.

On Jan 13, 2016, at 1:27 PM, SimonKrughoff notifications@github.com wrote:

@TomGlanzman https://github.com/TomGlanzman I did use the "-s R22_S11" option. Sorry, I should have pointed that out in my cookbook.

The catalogs are generated such that they will cover the central chip at any rotation angle. That means that the 9 chips in the central raft will have some coverage, but will not be useful for doing Twinkles.

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171388781.

TomGlanzman commented 8 years ago

Thanks for the image @cwwalter .

@SimonKrughoff Thanks for verifying that option. Are there others I should be using to reproduce your benchmark?

@johnrpeterson Yes, hopefully the times will reduce significantly. I have resubmitted a set of six jobs, one per filter (instance catalogs provided by @jchiang87) so should know by the end of the day.

drphilmarshall commented 8 years ago

Great stuff! :-) Sounds like we have realistic field rotation in - what about small dithers? I'm not sure it's really necessary since we're only looking at one chip, but thought I'd ask.

On Wednesday, 13 January 2016, Tom Glanzman notifications@github.com wrote:

Thanks for the image @cwwalter https://github.com/cwwalter .

@SimonKrughoff https://github.com/SimonKrughoff Thanks for verifying that option. Are there others I should be using to reproduce your benchmark?

@johnrpeterson https://github.com/johnrpeterson Yes, hopefully the times will reduce significantly. I have resubmitted a set of six jobs, one per filter (instance catalogs provided by @jchiang87 https://github.com/jchiang87) so should know by the end of the day.

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171413699 .

jchiang87 commented 8 years ago

The instance catalogs that Tom mentioned are here /nfs/farm/g/desc/u1/data/Twinkles/phosim/instance_catalogs on the slac disks (as mentioned in #66). There is no dithering being applied to those files, as far as I can tell.

SimonKrughoff commented 8 years ago

It's correct that there are no dithers other than rotational. @drphilmarshall , it would not be hard to add small (tens of pixels) dithers. I think we just need to make a decision. I don't know that we learn much from putting in dithers, so my vote would be to leave them out.

drphilmarshall commented 8 years ago

Fine with me. Let's see how the stack copes and see. Thanks!

On Wednesday, 13 January 2016, SimonKrughoff notifications@github.com wrote:

It's correct that there are no dithers other than rotational. @drphilmarshall https://github.com/drphilmarshall , it would not be hard to add small (tens of pixels) dithers. I think we just need to make a decision. I don't know that we learn much from putting in dithers, so my vote would be to leave them out.

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171423433 .

johnrpeterson commented 8 years ago

actually there will be dithers naturally from the perturbations even if there are none in the catalogs. so typically about an arcsecond (5 pixels) at most.

On Jan 13, 2016, at 4:41 PM, Phil Marshall notifications@github.com wrote:

Fine with me. Let's see how the stack copes and see. Thanks!

On Wednesday, 13 January 2016, SimonKrughoff notifications@github.com wrote:

It's correct that there are no dithers other than rotational. @drphilmarshall https://github.com/drphilmarshall , it would not be hard to add small (tens of pixels) dithers. I think we just need to make a decision. I don't know that we learn much from putting in dithers, so my vote would be to leave them out.

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171423433 .

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171443464.

drphilmarshall commented 8 years ago

Good - one slight worry I have is that the stack astrometry routines might fail if only presented with image sets conatining rotationsbut no translations. thanks!

On Thursday, January 14, 2016, johnrpeterson notifications@github.com wrote:

actually there will be dithers naturally from the perturbations even if there are none in the catalogs. so typically about an arcsecond (5 pixels) at most.

On Jan 13, 2016, at 4:41 PM, Phil Marshall <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Fine with me. Let's see how the stack copes and see. Thanks!

On Wednesday, 13 January 2016, SimonKrughoff <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

It's correct that there are no dithers other than rotational. @drphilmarshall https://github.com/drphilmarshall , it would not be hard to add small (tens of pixels) dithers. I think we just need to make a decision. I don't know that we learn much from putting in dithers, so my vote would be to leave them out.

— Reply to this email directly or view it on GitHub < https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171423433

.

— Reply to this email directly or view it on GitHub < https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171443464 .

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171662726 .

TomGlanzman commented 8 years ago

Some timing results are beginning to come in, but all job logs contain one or more seg-faults. @SimonKrughoff did you observe seg-faults in your test runs? For example, using the instanceCatalog phosim_input_g_0000860.txt, the log shows, in part,

[...]

Photon Raytrace

Installing Universe. Creating Air. Generating Turbulence. Building Optics. Placing Obstructions. Perturbing Design. Electrifying Devices. Contaminating Surfaces.

Diffracting.

/bin/sh: line 1: 6143 Segmentation fault (core dumped) /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.4.2/bin/raytrace < raytrace_860_R22_S11_E001.pars Process Process-2: Traceback (most recent call last): File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/anaconda/2.3.0/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/anaconda/2.3.0/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(_self._args, *_self._kwargs) File "/nfs/farm/g/lsst/u1/software/redhat6-x8664-64bit-gcc44/phoSim/3.4.2/phosim.py", line 42, in jobChip runProgram("raytrace < raytrace"+fid+".pars", binDir) File "/nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.4.2/phosim.py", line 68, in runProgram raise RuntimeError("Error running %s" % myCommand) RuntimeError: Error running /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.4.2/bin/raytrace < raytrace_860_R22_S11_E001.pars

Is there a way to enable additional debug output in phoSim?

The working directory for this particular run is here: /nfs/farm/g/desc/u1/Pipeline-tasks/Twinkles-phoSim/timeTests/filter1

The work and output directories for this particular run contain:

output: total 6 drwxrwsr-x 2 dragon desc 2 Jan 12 07:31 ./ drwxrwsr-x 4 dragon desc 5 Jan 13 12:00 ../

work: total 2200429 drwxrwsr-x 2 dragon desc 8 Jan 14 06:13 ./ drwxrwsr-x 4 dragon desc 5 Jan 13 12:00 ../ -rw------- 1 dragon desc 1969704960 Jan 14 06:12 core.6143 -rw------- 1 dragon desc 1969704960 Jan 13 21:07 core.7942 -rw-rw-r-- 1 dragon desc 5350 Jan 13 12:02 e2adc_860_R22_S11_E000.pars -rw-rw-r-- 1 dragon desc 5350 Jan 13 21:07 e2adc_860_R22_S11_E001.pars -rw-rw-r-- 1 dragon desc 12651448 Jan 13 12:02 raytrace_860_R22_S11_E000.pars -rw-rw-r-- 1 dragon desc 12651448 Jan 13 21:07 raytrace_860_R22_S11_E001.pars

The instance catalog is /nfs/farm/g/desc/u1/data/Twinkles/phosim/instance_catalogs/phosim_input_g_0000860.txt

jchiang87 commented 8 years ago

@TomGlanzman I'd like to look at those core dumps. Can you enable read permission please?

TomGlanzman commented 8 years ago

@jchiang87 done:

(Thu 09:28) dragon@comet (bash) $ pwd /nfs/farm/g/desc/u1/Pipeline-tasks/Twinkles-phoSim/timeTests/filter1/work (Thu 09:29) dragon@comet (bash) $ ls -l total 2200429 drwxrwsr-x 2 dragon desc 8 Jan 14 06:13 ./ drwxrwsr-x 4 dragon desc 5 Jan 13 12:00 ../ -rw-r--r-- 1 dragon desc 1969704960 Jan 14 06:12 core.6143 -rw-r--r-- 1 dragon desc 1969704960 Jan 13 21:07 core.7942 -rw-rw-r-- 1 dragon desc 5350 Jan 13 12:02 e2adc_860_R22_S11_E000.pars -rw-rw-r-- 1 dragon desc 5350 Jan 13 21:07 e2adc_860_R22_S11_E001.pars -rw-rw-r-- 1 dragon desc 12651448 Jan 13 12:02 raytrace_860_R22_S11_E000.pars -rw-rw-r-- 1 dragon desc 12651448 Jan 13 21:07 raytrace_860_R22_S11_E001.pars

jchiang87 commented 8 years ago

The problem is that the directory path to the phosim installation is too long for the code that reads the cosmic ray data. Here is the backtrace from one of those core files:

(gdb) bt
#0  0x00002b0e350b2ca4 in _IO_vfscanf_internal () from /lib64/libc.so.6
#1  0x00002b0e350c15e8 in fscanf () from /lib64/libc.so.6
#2  0x0000000000420186 in Image::cosmicRays (this=0x7fffbc89bf50, 
    raynumber=0x7fffbc89bed8) at cosmicrays.cpp:83
#3  0x0000000000431556 in Image::photonLoop (this=0x7fffbc89bf50)
    at photonloop.cpp:693
#4  0x00000000004077eb in main () at main.cpp:27

The limit for the full path to those files is 80 characters as seen here, but the path to those files wants to be something like /nfs/farm/g/lsst/u1/software/redhat6-x86_64-64bit-gcc44/phoSim/3.4.2/data/cosmic_rays/iray35.txt which is 96 characters long.

We could patch our local copy of the code, but there may be other cases where the char arrays are too short to handle the paths for our setup.

TomGlanzman commented 8 years ago

Thanks to @jchiang87. I vote for patching/rebuilding our local copy to continue making progress (and maybe we will run into another instance of this problem...). At the same time I would request @johnrpeterson and the phoSim development team to comb their code for directory path arrays and increase their lengths. In this day and age, is there any reason to have a limit on a directory path length less than, say, ~1000 characters? Or does it really need a fixed max length?

If that sounds reasonable, I'll rebuild the local SLAC copy.

johnrpeterson commented 8 years ago

yes, thanks, good find, Jim! please just patch your local copy. please issue a phosim ticket so we can do as tom says for the future. this issue hasn’t come up before.

john

On Jan 14, 2016, at 1:22 PM, Tom Glanzman notifications@github.com wrote:

Thanks to @jchiang87 https://github.com/jchiang87. I vote for patching/rebuilding our local copy to continue making progress (and maybe we will run into another instance of this problem...). At the same time I would request @johnrpeterson https://github.com/johnrpeterson and the phoSim development team to comb their code for directory path arrays and increase their lengths. In this day and age, is there any reason to have a limit on a directory path length less than, say, ~1000 characters? Or does it really need a fixed max length?

If that sounds reasonable, I'll rebuild the local SLAC copy.

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-171729247.

jchiang87 commented 8 years ago

I created the issue here

TomGlanzman commented 8 years ago

Results are beginning to appear. Expect testing to continue through the first part of next week.

https://confluence.slac.stanford.edu/display/~dragon/phoSim+Timing

drphilmarshall commented 8 years ago

Great stuff Tom! :-)

On Fri, Jan 15, 2016 at 3:37 PM, Tom Glanzman notifications@github.com wrote:

Results are beginning to appear. Expect testing to continue through the first part of next week.

https://confluence.slac.stanford.edu/display/~dragon/phoSim+Timing https://confluence.slac.stanford.edu/display/%7Edragon/phoSim+Timing

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-172126599 .

connolly commented 8 years ago

Just for reference, some historical run times and CPU times from our Google work

raytrace_all wide_nb_v423

TomGlanzman commented 8 years ago

Thanks @connolly . Can you characterize the instanceCatalog for these runs? Single sensor?

connolly commented 8 years ago

These were individual sensors with the standard stellar and galaxy densities (should be similar source densities to what you have). We would simulate full focal planes (so not just the central senor) and then use the trim program to subselect a catalog for a single sensor and simulate it independently. We used opsim to define the pointings. It ran concurrently on ~100K cores but in scavenger mode using spare cycles.

cheers Andy

On Fri, Jan 15, 2016 at 4:27 PM, Tom Glanzman notifications@github.com wrote:

Thanks @connolly https://github.com/connolly . Can you characterize the instanceCatalog for these runs? Single sensor?

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/91#issuecomment-172135795 .

TomGlanzman commented 8 years ago

The first round of timing tests is complete, including:

Should I be surprised that with the inclusion of physicsOverrides, execution times systematically increased across the board? Or that the current set of overrides represents only a minor decrease in execution time compared with the earlier set?

These test runs all fit comfortably within the existing batch queues. However, there is a significant range of execution times (~9-47 hequ hours) so there is a possibility that some instance catalog/filter combinations could exceed the 120 hour time limit.

SimonKrughoff commented 8 years ago

@TomGlanzman I guess I'm not too surprised that the new physics override file doesn't decrease things too much. Did you keep the stdout/stderr from these runs? The background should stay the same as for the v1 override. The only decrease would be in the time to simulate bright stars.

TomGlanzman commented 8 years ago

@SimonKrughoff Thanks. Yes, the logs are preserved (log into rhel6-64.slac.stanford.edu and look here: /nfs/farm/g/desc/u1/Pipeline-tasks/Twinkles-phoSim/timeTests; each test/run has its own directory and a log is contained therein). A quick tkdiff of a couple of logs indicates the only config difference is the blooming and saturation, as expected. And, yes, the Raytrace summary table does indicate a significant increase in processing rate for the brightest stars.

TomGlanzman commented 8 years ago

This issue may be reopened if/when additional phoSim configurations need timing data.