Closed SimonKrughoff closed 8 years ago
See logscraper and scraped CSV output here:
@sethdigel we've scraped some metadata from the logs. It would be really interesting to associate this with the input metadata.
I combined @sethdigel file and @brianv0 file and posted the result here:
So: our registration is not good. It looks like the visit images in the different filters are being registered separately, to a set of 6 coadd images that each look sharp, but that don't match each other. Is this a bug or a feature, @SimonKrughoff?
On Fri, Mar 11, 2016 at 2:39 PM, Tony Johnson notifications@github.com wrote:
I combined @sethdigel https://github.com/sethdigel file and @brianv0 https://github.com/brianv0 file and posted the result here:
https://gist.github.com/tony-johnson/66345752cf7ec0cc3ffa
— Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/175#issuecomment-195587261 .
Definitely a bug. I'm actually baffled how it's possible since we only have one reference catalog and it is essentially perfect.
I'd like to find the time to plot the reference catalogs on each of the coadds to see who is right and who is wrong. Maybe they are all wrong in different ways. It's very strange that the scatter reported by the astrometric solver is ~100mas. That's half a pixel which, in my experience, is huge. We can get ~40mas on CFHT using SDSS as the reference.
@SimonKrughoff I would really like to help on this ... it would be great to discuss and put down a list of tests you think we should be doing.
Probably it is not relevant, but I see that in the combined csv file by @tony-johnson the visit designators agree between his log files and the log files of the phosim runs (the relevant columns are 'visit' and 'obshistid') but the filter designators are completely different. For the phosim runs the filter designator is 0-5 for ugrizy. For example, for the first entry, Tony's log file indicates that the band is r. The phosim run says the band was u.
@sethdigel Can you please point to that combined csv file either on the data catalog or nersc? Even if this is not the cause of the problem, it would be something that we should fix. I looked at the OpSim file and the phosim input file (phosim instance catalogs) for a couple of obsHistIDs and found those to be consistent.
Well spotted, this does seem to be symptomatic of some sort of confusion, although I am not sure how much it can explain. The first filter column indicates which filter was used during the DM processing part of the workflow. This was obtained from the output file names generated by phosim, for example:
lsst_e_200_f2_R22_S11_E000.fits.gz (f=2=r) lsst_e_220_f1_R22_S11_E000.fits.gz (f=1=g) ... I checked the log from the jobs that run phosim, and for example for visit 200 the log contains: Filter (number starting with 0): 2 which appears to be consistent.
This appears to disagree with the second filter column, which came from @sethdigel file (http://www.slac.stanford.edu/~digel/lsst/run1_metadata.csv). I am not sure what the source of that data was -- Seth can you explain how you made your file? In that file the filter for run 200 is listed as 0 (u). @TomGlanzman might also be interested in this, and can hopefully explain how the phosim jobs determine which filter was used for each visit?
@rbiswas4 The combined file is here: https://gist.github.com/tony-johnson/66345752cf7ec0cc3ffa
Thanks! I just looked into this and there are a few more inconsistencies: Let us take the line corresponding to obsHistID of 220:
The expMJD value does not match that of the 220 pointing : 220 : 59580.135 , according to the csv : 61365.152 The closest obsHistID to have this expMJD is 1212062 which has expMJD = 61365.152869 and is a y band observation with different values of variables like rotskypos.
I made my file by parsing the log files from the Run 1 phosim runs and the input instance catalogs for those runs. In combining the metadata for each run, I made an indexing error in writing the obshistid (Opsim visit number) to the output file. I made a separate indexing error for rotskypos. I should have caught the error in the former - it made the obshistid column non-monotonically increasing - but I confused myself into thinking that actually made sense. Sorry about that.
I've put the corrected file here: http://www.slac.stanford.edu/~digel/lsst/run1_metadata_v3.csv (v2 was a version that we came up with on Friday that changed the hostname designators from character strings to integers for easier handling.) The filter designators, dates, etc., now match the instance catalog information for the corresponding obshistids.
The error did not affect any of the plots that I made (because they do not depend on the obshistid), and did not affect the machine learning study of CPU times, because they likewise did not use obshistid, and rotskypos was not relevant either. But I'll need to ask @brianv0 to please remake his combined file.
Updated combined.csv file is here:
O.K. These aren't very specific, but I think we should try to pin it down to input catalogs, simulation, or astrometric registration. I suspect it is the latter, but maybe I'm fooling myself.
Hi, I just looked at the new file and I notice a few things different still:
I just looked at the values corresponding to obsHistID 220:
>>> df.ix[220, 'expMJD']
59580.135414999997
According to the csv file it is 59580.137 which is just ~ 100 ish seconds off. I checked the phosim instance and the values match opsim there. Would this induce a 'tracking like' error if phosim thinks the time is slightly different?
Note : There are differences in the raw numbers for rotskypos, altitude but @danielsf pointed out that this is due to units (phosim uses degrees and Opsim uses radians) and after conversion, they match nicely.
I'm afraid that the difference is due to my treating expmjd as a floating point quantity rather than double precision. In the instance catalog for this run Opsim_expmjd is defined as 59580.1354, which I see got rounded to 59580.137 in my csv file. (I did not have astrometry in mind when I made the file.) I can re-run it with double precision, but from the instance catalog the precision would seem to be limited to 1e-4 day, i.e., about 10 s. Presumably the Opsim database has more precise times, but on the other hand, I guess that the times in the instance catalogs represent the actual (simulated) time of the exposure.
In case it is useful, I went back through the instance catalogs and run logs, this time treating all the phosim parameters from the instance catalogs as double precision and being sure to keep all of the precision in the various quantities. I've made a new version of the csv file showing all of the quantities to full, including an extra decimal place (which is always zero) to show that the precision of the instance catalog quantities has been preserved. The file is here: http://www.slac.stanford.edu/~digel/lsst/run1_metadata_v4.csv
OK this fixes the problem we were discussing!
@tony-johnson where did the new coadds end up on SLAC machines? I want to try making new color images with the saturation fix in.
HI, the output of the jobs we ran last friday are on bullet0002, in:
/lustre/ki/pfs/fermi_scratch/Twinkles/2
however looking at the jobs only two of the coadd jobs actually finished successfully (r and z), the others ran out of time and were killed by the batch system. If you want me to try re-running them, with more CPU time and/or fixed code let me know.
I see. Can we rerun them with more time? I don't think I made any changes that should change run time.
Hi Simon, 5 out of 6 of the coadd jobs have now run, (the g filter did not run due to a division by zero error in one of the processEimage jobs). The coadd jobs took a surprisingly long time to run
Filter | CPU | Wallclock |
---|---|---|
u | 6978.17 | 6.5 hours |
g | - | - |
r | 2869.88 | 1 hour |
i | 3554.16 | 15.5 hours |
z | 3875.95 | 1 hour |
y | 4566.01 | 19 hours |
Jobs are here:
Output is here: bullet0002.slac.stanford.edu:/lustre/ki/pfs/fermi_scratch/Twinkles/2
The CPU time per co-added image ranged from 22.3 s for the y band to 27.7 s for the u band. I have no idea whether that is a surprisingly long time.
For possible future reference, it does look like the coaddition jobs did not use all of the available Run 1 images; from a quick look my impression is that the lists of visit numbers were made before all of the phosim jobs had finished. The table below lists the number of Run 1 images by filter (tallied from the phosim jobs that completed) and the numbers of images listed in the log files for the coadd jobs.
Filter | Run 1 images | Number in the coadd job |
---|---|---|
u | 254 | 252 |
g | 134 | - |
r | 163 | 126 |
i | 186 | 141 |
z | 195 | 174 |
y | 192 | 180 |
Hi Seth, I only reran the jobs in my original 985 visit run, which included phosim jobs which were finished at the time I started, and did not crash in processEimage (when I started, although there is one additional crash since we upgraded the DM installation).
@SimonKrughoff Can you please report on this issue at the Twinkles meeting tomorrow, please? Agenda is linked from #183 Thanks!
This issue is starting to look like a real roadblock! Can you give us an update in the Twinkles weekly meeting tomorrow, please @SimonKrughoff ? Thanks!
@tony-johnson I'm trying to look at these coadds again, but I only see i, u, and y. And when I look closer, there are no files in those directories. Did these move?
I'm looking in: /lustre/ki/pfs/fermi_scratch/Twinkles/2 on bullet.
Tony's in Banff this week - maybe someone else can help Simon out?
On Wed, Apr 13, 2016 at 6:34 PM, SimonKrughoff notifications@github.com wrote:
@tony-johnson https://github.com/tony-johnson I'm trying to look at these coadds again, but I only see i, u, and y. Are there others?
I'm looking in: /lustre/ki/pfs/fermi_scratch/Twinkles/2 on bullet.
— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/175#issuecomment-209715223
Unfortunately, files in that scratch area are automatically deleted after 7 days. I don't know if they had been moved elsewhere before that happened.
Hi, I did not realize that those files would be deleted, and the most recent runs were not copied off of the scratch area. I can recreate them (but will take a day or so). Should I rerun exactly the same as before?
Tony
@tony-johnson yes please re-run them. Thanks!
So.... Here's where I am. I have g, i, and y coadds from run 1. I over-plot the reference sources in green on the coadds using the coadd WCS and get the following. In the order: g, i, y. As you can see, none of them agree with the reference and none of them disagree in the same way.
I don't know why this is yet. I need to go back and see when the two begin to disagree.
@tony-johnson We talked about re-running things and we think it's probably better to hold off for a bit.
I think I may have a handle on the astrometric issues, and if we wait until next week, we may have a chance to fix some bugs as well.
The i, u, y files were written on April 11 and so nominally will be deleted on the 18th.
From today's notes:
"Looks like we are using too many faint, undetected objects to register the CalExps - proposed solution is to only use bright objects, preferably in configuration rather than by making a separate reference catalog. Action: Simon to check on this and update the cookbook appropriately."
Let us know what you find out, @SimonKrughoff ! Here's hoping it's a simple configuration change.
@tony-johnson @drphilmarshall It turns out I was wrong. I don't believe this is due to fragility in the matching. When looking at all the CoaddTempExp files, they all register well when matched on WCS.
It turns out that the WCS in the coadd is different from the CoaddTempExp files. This can't really happen in normal processing flow. I think the only way this is possible if the makeDiscreteSkymap.py step is run a second time. It should only be run once to produce a master skymap. It should be run when all input data are available, but before any coaddTempExps have been generated.
Unfortunately, @drphilmarshall I don't think this will help us out with the diffim template issue.
That's OK - one thing at a time! :-)
Let's fix the coadd WCS then. Are you saying that the cookbook is correct but the workflow is not, or that you need to edit the cookbook?
On Thu, Apr 14, 2016 at 3:54 PM, SimonKrughoff notifications@github.com wrote:
@tony-johnson https://github.com/tony-johnson @drphilmarshall https://github.com/drphilmarshall It turns out I was wrong. I don't believe this is due to fragility in the matching. When looking at all the CoaddTempExp files, they all register well when matched on WCS.
It turns out that the WCS in the coadd is different from the CoaddTempExp files. This can't really happen in normal processing flow. I think the only way this is possible if the makeDiscreteSkymap.py step is run a second time. It should only be run once to produce a master skymap. It should be run when all input data are available, but before any coaddTempExps have been generated.
Unfortunately, @drphilmarshall https://github.com/drphilmarshall I don't think this will help us out with the diffim template issue.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/175#issuecomment-210193593
I think it is simply a transcription error when translating the cookbook to a parallel system. I'll have a look.
Ah, here it is. I think this line needs to go here and that will solve the problem.
@tony-johnson if you do a short run with this change, I should be able to tell quickly if things are ship shape.
Edit: We will need to change the line in assembleCoadd to be all the visits we have available. This should be doable by just specifying the --id
flag with no argument.
I will also say that we should probably remove all the --clobber-config options to the tasks. That is fine for debugging, but if the config is changing, we should alert on that.
Bonza! @tony-johnson, we could be almost back on track!
On Thu, Apr 14, 2016 at 4:09 PM, SimonKrughoff notifications@github.com wrote:
Ah, here it is. I think this https://github.com/DarkEnergyScienceCollaboration/Twinkles/blob/7f249b8c0fae8ed32d2364ae615dc7b9ad14eff5/workflows/DM/processEimage#L4 line needs to go here https://github.com/DarkEnergyScienceCollaboration/Twinkles/blob/7f249b8c0fae8ed32d2364ae615dc7b9ad14eff5/workflows/DM/assembleCoadd#L1 and that will solve the problem.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/175#issuecomment-210197891
OK, I just reran the original yesterday, using GPFS as the file system, and it finished in <24 hours. The output is here:
/gpfs/slac/kipac/fs1/g/desc/Twinkles/302
if anyone is interested. Will rerun again with the fix from Simon.
The data has been reprocess with Simon's suggested change, and at SLAC the output can be found here:
/nfs/slac/kipac/fs1/g/desc/Twinkles/303
The final coadd job is still running, but the six single-filter coadds are complete.
@tony-johnson can you remind me where the stack is at SLAC?
O.K. Somehow that made things worse... I'll keep looking.
From looking at the cookbook and the comments above, my understanding is that makeDiscreteSkyMap.py should be run exactly once on all of the visits. The change made here
now has it being run once per filter. To run it exactly once, I think the xml part of the workflow needs to be amended and a new bash script executed that just does this task.
@jchiang87 Thanks for pointing to that commit, I couldn't find it initially.
You are exactly right. I wasn't clear before, but makeDiscreteSkymap.py needs to be run exactly once per data run, and it needs to be run after all the images have been run through processEimage.py (i.e. after all the calexp files we expect to use in the coadd exist on disk).
Edit: you say it is run once per filter, but the variable name makes it sound like it is being run once per visit. But, I don't understand the variables and the workflow like you do.
OK, then as Jim says I will need to modify the workflow to handle that, I should be able to do that today and try to rerun it over night.
Tony
On 04/18/2016 09:52 AM, SimonKrughoff wrote:
You are exactly right. I wasn't clear before, but makeDiscreteSkymap.py needs to be run exactly once per data run, and it needs to be run after all the images have been run through processEimage.py (i.e. after all the calexp files we expect to use in the coadd exist on disk).
@tony-johnson do you want to touch base before you trigger another big run to make sure we've got everything in place? A small run would also help. If any of the WCSs in the coaddTempExp files differ from any of the WCSs in the coadds, we are going to run into this problem.
The version of the Stack used for Run1 is available at
/nfs/farm/g/desc/u1/LSST_Stack_2016-02-23
to use it from a rhel6-64 machine, just execute
/nfs/farm/g/desc/u1/LSST_Stack_2016-02-23/run_obs_lsstSim_setup
and it will create a bash shell with devtoolset-3 and needed packages.
There is a new Stack installed using the recent conda distribution of lsst-apps and lsst-sims (the same as made recently available at NERSC (#210)). It can be set up by doing
/nfs/farm/g/desc/u1/LSST_Stack_2016-04-12/run_setup
This has the most recent tickets/DM-4302 branch built and set up. I've done some tests on it, e.g., running all of the cookbooks scripts, and they all run, though not without errors, so it is worth trying this install out on a consistent dataset.
HI, yes might be worth having a brief hangout/bluejeans meeting before re-running. Would 4pm this afternoon work? I can run on a smaller set of visits to save time, what would a reasonable number of visits be to be useful?
Tony
On 04/18/2016 09:56 AM, SimonKrughoff wrote:
@tony-johnson https://github.com/tony-johnson do you want to touch base before you trigger another big run to make sure we've got everything in place? A small run would also help. If any of the WCSs in the coaddTempExp files differ from any of the WCSs in the coadds, we are going to run into this problem.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/DarkEnergyScienceCollaboration/Twinkles/issues/175#issuecomment-211472330
4PM works for me. If you run 10 visits in each band, I think that would be enough.
The coadds using the thousand run data show serious problems with the astrometric solutions. None of the three bands agree with each other...
We need to figure this out. @brianv0 and @tony-johnson are working on a way to scape the logs so we can correlate log output with other parameters.