LSSTDESC / Twinkles

10 years. 6 filters. 1 tiny patch of sky. Thousands of time-variable cosmological distance probes.
MIT License
13 stars 12 forks source link

Visit metadata analysis #159

Closed drphilmarshall closed 8 years ago

drphilmarshall commented 8 years ago

In the PhoSim image generation master thread https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/137#issuecomment-191938375 @sethdigel is making some cracking plots showing the link between observing conditions and PhoSim run times. I think we should extend this a little bit, so we can look at how things like observed image quality, observed image depth and so on are distributed, and how they depend on each other. Seth, @jchiang87 - how should we proceed? Should we start validation module, put some functions in it, and make some more scripts for the workflow? Or focus on collating more metadata, and making it available for analysis by more people? Is a PhoSim CPU time predictor a useful tool that we should try and put together?

drphilmarshall commented 8 years ago

Talking to @sethdigel , we decided to try extending his analysis into the murky world of scikit-learn at the Hack Day next week - we'll just aim for a nice notebook to start with, and can figure out scripts later. I guess we can put the notebook in examples/notebooks, although what we are talking about is not really an example...

Seth, here's the machine learning example notebook I showed you:

https://github.com/drphilmarshall/StatisticalMethods/blob/master/examples/SDSScatalog/Quasars.ipynb

In that repo there is also an introductory tutorial, and both are linked from lesson 9's notebook. I'd suggest that you fork and clone that repo, and try running the lesson 9 notebooks. And then we'll just need visit metadata in csv format for next Friday :-)

drphilmarshall commented 8 years ago

PS. @sethdigel can you please post this hack idea to the Hack Day confluence page, and provide a link to this thread? You'll need some sort of catchy project name :-)

sethdigel commented 8 years ago

PS. @sethdigel can you please post this hack idea to the Hack Day confluence page, and provide a link to this thread? You'll need some sort of catchy project name :-)

Done. I went for descriptive rather than catchy.

About 200 of the Run 1 visits are still running in the pipeline. These jobs are all up to about 4000 minutes of CPU time. By some time tomorrow all of the Run 1 jobs will have either finished or hit the 5-day run time limit, at which point a complete set of metadata for Run 1 Phosim simulated visits can be assembled.

drphilmarshall commented 8 years ago

Perfect! This should be fun :-)

On Fri, Mar 4, 2016 at 11:47 PM, Seth Digel notifications@github.com wrote:

PS. @sethdigel https://github.com/sethdigel can you please post this hack idea to the Hack Day confluence page, and provide a link to this thread? You'll need some sort of catchy project name :-)

Done. I went for descriptive rather than catchy.

About 200 of the Run 1 visits are still running in the pipeline. These jobs are all up to about 4000 minutes of CPU time. By some time tomorrow all of the Run 1 jobs will have either finished or hit the 5-day run time limit, at which point a complete set of metadata for Run 1 Phosim simulated visits can be assembled.

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/159#issuecomment-192602638 .

sethdigel commented 8 years ago

About 150 of the Run 1 visits are still running. I did the math wrong last night regarding how much longer they can go before hitting the CPU time run limit, which is 7200 minutes. The remaining jobs will not hit the CPU time limit until Monday morning.

Also, I am thinking that the analysis will need to fold in the type of batch host in the SLAC farm. Some generations of batch hosts do a lot more per CPU minute than others (and this may be the reason for the banding of CPU times for a given moonalt, moonphase, and filter). Probably we can at least normalize CPU times to a common scale.

The table below are the 'CPU Factors' assigned in the LSF system for the various classes of hosts that have been used in generating Run 1. These presumably relate to relative speeds. About half of the jobs ran on hequ hosts and one third on fell hosts.

Host class CPU Factor
bullet 13.99
dole 15.61
fell 11.00
hequ 14.58
kiso 12.16
drphilmarshall commented 8 years ago

Nice! Host class CPU Factor looks like an excellent new feature to add to the csv file, and we can decide on Friday whether to correct the CPU times or just throw in the CPU factor as a feature.

On Sat, Mar 5, 2016 at 10:50 PM, Seth Digel notifications@github.com wrote:

About 150 of the Run 1 visits are still running. I did the math wrong last night regarding how much longer they can go before hitting the CPU time run limit, which is 7200 minutes. The remaining jobs will not hit the CPU time limit until Monday morning.

Also, I am thinking that the analysis will need to fold in the type of batch host in the SLAC farm. Some generations of batch hosts do a lot more per CPU minute than others (and this may be the reason for the banding of CPU times for a given moonalt, moonphase, and filter). Probably we can at least normalize CPU times to a common scale.

The table below are the 'CPU Factors' assigned in the LSF system for the various classes of hosts that have been used in generating Run 1. These presumably relate to relative speeds. About half of the jobs ran on hequ hosts and one third on fell hosts. Host class CPU Factor bullet 13.99 dole 15.61 fell 11.00 hequ 14.58 kiso 12.16

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/159#issuecomment-192816451 .

brianv0 commented 8 years ago

@sethdigel I think you'll want to double check this. SLAC discourages the use of CPU time (or named queues) and suggests users only use a wall clock time and, in fact, the bsub wrapper script SLAC maintains is actually supposed to divide a supplied CPU time by 5 and set then just set that as the wall clock time. That number was arrived at by assuming a job might run on a fell machine (CPU factor ~10) and giving it twice the amount of time to run.

sethdigel commented 8 years ago

Thanks, Brian. What we are looking at doing is studying the dependence of the actual CPU time for phosim runs (extracted from the pipeline log file for the runs) on some basic parameters in the phosim instance catalogs (like the altitude of the moon). In a preliminary look at the Run 1 output, posted on issue #137, for runs with similar parameters (moon altitude, moon phase, filter), the CPU times appear to have two ranges, separated by a constant factor. I have not looked into it quantitatively yet, but I was guessing that it might be due to the batch hosts not all being the same speed. I posted the CPU Factors for potential future reference because that was the closest thing that I could find that looked like a measure of relative speeds, and because it was not particularly easy to find (it involved using the bhost command for specific hosts).

sethdigel commented 8 years ago

About 40 phosim runs are still going. Almost all are about to reach the run limit. Three are runs that Tom restarted after they failed for some (probably transient) reason. It looks like about 105 of the runs either have timed out, or will. These represent 1.4 CPU years. The runs that finished used 3.5 CPU years.

I've made a csv file with the metadata from the phosim runs, collected from the instance catalogs and the log files from the pipeline. Here are the headings: obshistid, expmjd, filter, start, end, altitude, rawseeing, airmass, moonalt, moonphase, dist2moon, sunalt, cputime, hostname, runlimit. (start and end are the MJD starting and ending times of the runs, in case wall clock time turns out to be interesting, hostname is the first character of the batch host name, and runlimit is a flag for runs that hit the limit - which did not always occur at exactly the same CPU time.)

Where in github or Confluence would be the right place to put the file?

Here is an updated plot of CPU time vs. moonaltitude, with the points color coded by filter and sized according to moonphase. The horizontal dashed line is the approximate run limit (5 days) and the vertical dashed line is at 0 deg moon altitude. The histogram has a linear scale and shows the distribution of moon altitude for the runs that are still going. The points with '+' signs hit the run limit (and so produced no phosim output).

cpu_moonalt_comb

drphilmarshall commented 8 years ago

Nice! K nearest neighbors or Random Forest are going to clean up on the CPU time prediction, I think. Can you post this plot and two bullets of text to the Twinkles slides at https://docs.google.com/presentation/d/1MdGGDrITW4-n04goJNYBAoVwnQBjRkibxuJY8EJZWXc please? Good to chat about this in our session.

I'd say, just put the data file in your public_html for now, and post the URL to this thread. We can pull directly from there in our hack notebook. Thanks Seth! :-)

On Tue, Mar 8, 2016 at 12:08 AM, Seth Digel notifications@github.com wrote:

About 40 phosim runs are still going. Almost all are about to reach the run limit. Three are runs that Tom restarted after they failed for some (probably transient) reason. It looks like about 105 of the runs either have timed out, or will. These represent 1.4 CPU years. The runs that finished used 3.5 CPU years.

I've made a csv file with the metadata from the phosim runs, collected from the instance catalogs and the log files from the pipeline. Here are the headings: obshistid, expmjd, filter, start, end, altitude, rawseeing, airmass, moonalt, moonphase, dist2moon, sunalt, cputime, hostname, runlimit. (start and end are the MJD starting and ending times of the runs, in case wall clock time turns out to be interesting, hostname is the first character of the batch host name, and runlimit is a flag for runs that hit the limit - which did not always occur at exactly the same CPU time.)

Where in github or Confluence would be the right place to put the file?

Here is an updated plot of CPU time vs. moonaltitude, with the points color coded by filter and sized according to moonphase. The horizontal dashed line is the approximate run limit (5 days) and the vertical dashed line is at 0 deg moon altitude. The histogram has a linear scale and shows the distribution of moon altitude for the runs that are still going. The points with '+' signs hit the run limit (and so produced no phosim output).

[image: cpu_moonalt_comb] https://cloud.githubusercontent.com/assets/6035835/13595599/96802ebe-e4c1-11e5-9130-0d295766557c.png

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/159#issuecomment-193655284 .

sethdigel commented 8 years ago

The metadata for the Run 1 phosim runs are in this file.

Run 1 has 1227 observations of a Deep Drilling Field at RA, Dec = 53.0091, -27.4389 deg (J2000). As of this writing 4 of the simulation runs are incomplete; they are flagged as described in the table below. In each case, these runs crashed for some (probably) transient reason and were restarted by Tom.

Column Description
obshistid Opsim designator of the visit.
expmjd MJD of the (simulated) observation
filter 0-5 for ugrizy
rotskypos angle of sky relative to camera coordinates (deg)
start_run starting time of the phosim run on the batch farm (MJD)*
end_run ending time of the phosim run on the batch farm (MJD)*
altitude elevation of the observing direction (deg)
rawseeing seeing at 500 nm (arcsec), a phosim input
airmass airmass at the altitude of the observation
moonalt elevation of the Moon (deg)
moonphase phase of the Moon (0-100)
dist2moon angular distance of the Moon from the observing direction
sunalt elevation of the Sun (deg)
cputime CPU time required for the phosim run (sec)*
hostname First character of the name of the batch host that ran the job (b, d, f, h, k)*
runlimit Flag indicating whether the phosim run was terminated at the 5-day execution time limit (1 = yes)

* An x in the hostname column or a negative number for CPU time indicates that the phosim run is still executing. These jobs also have 0 as the start and end time.

drphilmarshall commented 8 years ago

Excellent! We are all set.

On Tue, Mar 8, 2016 at 3:29 PM, Seth Digel notifications@github.com wrote:

The metadata for the Run 1 phosim runs are in this file http://www.slac.stanford.edu/%7Edigel/lsst/run1_metadata.csv.

Run 1 has 1227 observations of a Deep Drilling Field at RA, Dec = 53.0091, -27.4389 deg (J2000). As of this writing 4 of the simulation runs are incomplete; they are flagged as described in the table below. In each case, these runs crashed for some (probably) transient reason and were restarted by Tom. Column Description obshistid Opsim designator of the visit. expmjd MJD of the (simulated) observation filter 0-5 for ugrizy start starting time of the phosim run on the batch farm (MJD) end ending time of the phosim run on the batch farm (MJD) altitude elevation of the observing direction (deg) rawseeing seeing parameter of phosim (arcsec?) airmass airmass at the altitude of the observation moonalt elevation of the Moon (deg) moonphase phase of the Moon (0-100) dist2moon angular distance of the Moon from the observing direction sunalt elevation of the Sun (deg) cputime CPU time required for the phosim run (sec) hostname First character of the name of the batch host that ran the job (b, d, f, h, k) runlimit Flag indicating whether the phosim run was terminated at the 5-day execution time limit (1 = yes)

  • An x in the hostname column or a negative number for CPU time indicates that the phosim run is still executing. These jobs also have 0 as the start and end time.

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/159#issuecomment-194020381 .

sethdigel commented 8 years ago

Phil's machine learning example notebook runs for me. This is sort of a Hello World plot showing some of the metadata for Run 1. So, yes, I think it will work.
run1_everything

drphilmarshall commented 8 years ago

Oh cool! :-) This is great, Seth. Let's spend some time this morning staring at the full plot, to get some feel for what is going on. This is of course not part of the traditional machine learning development flow but sod it, we're physicists. Then I think we can just step through the notebook, editing both the markdown and the python to train the KNN model and make some predictions :-)

On Thu, Mar 10, 2016 at 5:21 PM, Seth Digel notifications@github.com wrote:

Phil's machine learning example notebook runs for me. This is sort of a Hello World plot showing some of the metadata for Run 1. So, yes, I think it will work.

[image: run1_everything] https://cloud.githubusercontent.com/assets/6035835/13690344/5ca48666-e6e4-11e5-9312-aad863a5de5d.png

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/159#issuecomment-195130697 .

rbiswas4 commented 8 years ago

Incidentally, whatever methods are being used here can also be used to study correlations of five sigma depths with other columns in OpSim. Obviously these are simulated, but it might be a fun project to do exactly the same thing with the OpSim outputs with the interesting variables being fivesigmadepth, or Seeing. At least that exercise would give me some insight into stuff that obsververs have a good intuition for already. I can get the OpSim output in the form of a dataframe, and I propose we combine with the group looking into observing strategy today.

drphilmarshall commented 8 years ago

Good idea.

Seth I would bet something expensive (your bike?) that Random Forest is going to work better than KNN in this problem, especially of we dont split the data by filter (and we don't have time for such sensible things today). But still, let's get KNN working first and then unplug it and replace it.

On Friday, March 11, 2016, rbiswas4 notifications@github.com wrote:

Incidentally, whatever methods are being used here can also be used to study correlations of five sigma depths with other columns in OpSim. Obviously these are simulated, but it might be a fun project to do exactly the same thing with the OpSim outputs with the interesting variables being fivesigmadepth, or Seeing. At least that exercise would give me some insight into stuff that obsververs have a good intuition for already. I can get the OpSim output in the form of a dataframe, and I propose we combine with the group looking into observing strategy today.

— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/159#issuecomment-195422395 .

drphilmarshall commented 8 years ago

@sethdigel @humnaawan and @tmcclintock: nice work on Friday! Reproducing your main result here:

From this plot it looks to me as though you are able to predict CPU time to about +/- 0.2 dex (95% confidence, very roughly), no matter what the absolute CPU time is (although it would be good to be more precise about this). 0.2 dex corresponds to about 50% uncertainty, or ranges like "5000 to 15000 CPU hours." I think this could be useful ( @TomGlanzman can comment further ), and we didn't even get into extending the data with more OpSim or PhoSim parameters.

The next step could be to extract the ML parts of your notebook and repackage them into a PhoSimPredictor class, which could be trained and then pickled for use before every new PhoSim run to determine which queue to use. Again, we should be guided by Tom here. Let us know if you're interested in helping with this! And thanks for all your efforts on Friday - what a nice hack! :-)

TomGlanzman commented 8 years ago

I hope @drphilmarshall 's quoted range of 5000 to 15000 CPU hours ... is really seconds. Or were you looking at integrated CPU hours for Twinkles-phoSim?

In terms of job scheduling, it will certainly be extremely useful to know the (approximate) CPU time ranges. However, unless and until we get some form of checkpointing running, one would also need the facility within the Pipeline to customize the job run time on a per-stream basis. Otherwise, we are in the same situation as now: send all jobs to the same, maximum time queue. And/or we arbitrarily make a cut so that any phoSim that requires > NN hours of CPU is simply not attempted.

cwwalter commented 8 years ago

I haven't had time to actually play with this yet but at NERSC they have:

http://slurm.schedmd.com/checkpoint_blcr.html

As far as I can tell this is done at a system/kernel level so you don't have to change the actual code.

TomGlanzman commented 8 years ago

I tried exercising checkpointing a couple of years ago at NERSC but wound up frustrated because carver did not support this feature. Now, with both a new architecture (cori) and batch system (slurm) it is probably worth trying again...

tony-johnson commented 8 years ago

However, unless and until we get some form of checkpointing running, one would also need the facility > within the Pipeline to customize the job run time on a per-stream basis.

@TomGlanzman, this feature has been built into the workflow engine since day one, as is used extensively in the EXO data processing. There should be no problem using this with your phosim task.

cwwalter commented 8 years ago

@tony-johnson are you referring to BLCR? Do you have any simple examples of how to use it in a slurm file?

tony-johnson commented 8 years ago

@cwwalter no sorry, I was referring to Tom Glanzman's post above yours concerning using the CPU time estimate to set the time required for a batch job. (GitHub needs to add threaded conversations).

The blcr does look quite interesting, I read the FAQs linked from the page you referenced above and it certainly seems as if it might be usable.

jchiang87 commented 8 years ago

This seems to have been addressed by the hack day project at #177 and #178. I'll open a new issue for Phil's proposed PhoSimPredictor class.