LSSTDESC / SSim_DC1

Configuration, production, validation specifications and tools for the DC1 Data Set.
4 stars 2 forks source link

DC1 Chips To Simulate #29

Closed humnaawan closed 7 years ago

humnaawan commented 7 years ago

@cwwalter, @danielsf There appears to be an important but subtle bug in the chip-finding algorithm s.t. there should be more chips to simulate than the ones in the current pickle output. The symptom that led @egawiser and I to discover it is the bimodality in the chip count for the dithered survey:

screen shot 2017-04-03 at 5 36 51 pm

I checked to make sure that the dithering was implemented correctly: the dithered pointings do move enough that more than half of the neighboring chips should be coming into the proposed DC1 region — something that the gap in the bimodal histogram does not reflect.

After quite a bit of debugging (many thanks to @danielsf for trying an independent method to confirm our concerns), it appears that the slicer interface is not working properly (specifically the routine, slicer.getSimData, that is supposed to give all the instances of observations in a given data slice, i.e., a HEALPix pixel). I don’t fully understand why the problem is arising.

While I further investigate this apparent source of the problem, I am running a workaround that bypasses that slicer-dependent interface with simdata. We don't think the current chips-to-simulate list has extraneous chips but is simply missing some. We suspect an increase of 30-50% in the chip count and should know the exact number once the new run finishes in ~24 hours.

It is unclear to us what is the best way forwards in terms of how/when to run the additional chips through PhoSim.  If the order of PhoSim runs is unimportant, we could allow the current set to finish while investigating this further and then submit the additional jobs for chips in the new pickle list that are not in the initial one. We apologize for not identifying and fixing this problem sooner!

cwwalter commented 7 years ago

I believe we can add the new visits to the PhoSim list.

@TomGlanzman Can you confirm.

I believe this won't affect the undithered imSim runs that are just starting since we are using the visit coordinates directly.

@jchiang87 Can you confirm?

Many people are away this week for hack week, ops review etc. But, lets meet anyway to discuss this. @humnaawan will have a presentation and hopefully solutions by Thursday.

jchiang87 commented 7 years ago

I believe this won't affect the undithered imSim runs that are just starting since we are using the visit coordinates directly.

@jchiang87 Can you confirm?

Yes, confirmed.

cwwalter commented 7 years ago

@humnaawan This means we will conceivably have twice as many visits to process in total?

humnaawan commented 7 years ago

@humnaawan This means we will conceivably have twice as many visits to process in total?

Yes, we think 50% is the higher-end limit on the number of missing chips.

TomGlanzman commented 7 years ago

@cwwalter If I understand correctly, existing visits may/will have additional chips to simulate -- rather than new visits? If the former, then, offhand, I do not see an easy way to incorporate the new chips...but will think on it. If the latter, then, yes, one may append new visits to the existing pickle list. Another possibility is to duplicate an old visit with a new visit which contains only the chips missed the first time around. But I would need to check that such a scheme would work.

If it is decided that the current phoSim running should be halted for any reason, please let me know asap.

sethdigel commented 7 years ago

For what it is worth, in the DC1-phoSim-3 task, for each visit, trimcat files are generated for every sensor in the focal plane, not just those in Humna's pickle file. Not that they are a major driver of CPU time, but in principle they already would be available.

TomGlanzman commented 7 years ago

@sethdigel You are correct: the reason I left all the trim files in was in case some questions arose regarding the instanceCat generation or the trim itself. In the future, I will probably trim the trims just to save disk space. But, as you say, the amount of time required to re-create them is trivial compared with the raytrace step.

cwwalter commented 7 years ago

OK @sethdigel and @TomGlanzman: for Thursday, could you guys discuss amongst yourselves first the feasibility of hacking something up top add sensors to existing visits based on the trim files and then report to us?

Unless you think that things are going to be a lot different than before I think we would lose a month by starting over again right?

TomGlanzman commented 7 years ago

@cwwalter Hard to say how much time we might lose by starting over given the imponderables of NERSC/SLURM dispatch algorithm. If we stopped right now, we would likely lose ~ 2 weeks. On the other hand, we have chewed up a noticeable fraction of our yearly CPU allocation, so starting over might mean we could not finish in a reasonable time frame. My vote is to kludge something so as to keep what we have.

TomGlanzman commented 7 years ago

@humnaawan Do you have a new pickle file prepared with the updated sensor list? If so, I'd would like to consider putting it into production. The workflow has read and configured exactly the first 1000 visits, so this might be a good time to make the switch - and then figure out how to correct the existing visits.

humnaawan commented 7 years ago

@TomGlanzman @cwwalter @egawiser I just pushed the updated pickle file to the repo. Note that the pickle file is named a little differently (has an nside tag and mentions containing only non-WF sensors) and has an additional key (pixNum; gives the triggered pixel numbers; for tracking/debugging purposes).

As opposed to the 123,975 science sensors before, we now have 184,658 sensors. Contrary to what I had thought, there were sensors in the previous list that are not in the new list (4969 of them):

screen shot 2017-04-07 at 10 12 18 pm

We checked and found these extraneous sensors to be either at the edges of the DC1 region or at the edges of the FOVs; a combination of the bug in Healpix slicer call and the lower resolution (Nside 512). I have plotted a subsample of these sensors; see output 33 here.

We also checked how the resolution changes things, comparing Nside 512 result with Nside 1024:

screen shot 2017-04-07 at 10 12 02 pm

We see a slight increase in the total sensor count (by 32) with the higher resolution. Some examples of these are also plotted in the notebook referenced above.

Please let me know if there are any concerns. I will try to construct a metric to ensure that everything's ok now. Sorry for all the trouble!

TomGlanzman commented 7 years ago

@humnaawan Thank you for the updated visit pickle file. I have a couple of question.

  1. The new pickle file is 22MB in size, whereas the old one was 2 MB. There is a new element in the latest pickle file, "pixNum". Is this to be expected?

  2. The old pickle file contains 2395 visits whereas the new one contains only 2245 visits. There are 156 "old" visits missing from the new pickle, and 6 new visits have appeared. This is a significant issue for the current workflow -- assuming we want to perform an 'update in place'. Could you confirm that this is an expected result of your new slicing/selection algorithm?

I should add that to make the workflow update work smoothly, I will need a revised pickle file that contains exactly the same number of visits, exactly the same "obsHistID"s, a revised list of "chipNames", and they must be sorted into the exact same order in the pickle file.

humnaawan commented 7 years ago

@TomGlanzman The "pixNum" key is expected. As I mentioned above, I needed to track the pixel numbers to see which chips were being triggered. I resaved the pickle without the pixNum array and the file size has reduced to the older, 2MB. If you prefer, you can use the latter; available here.

As for the visit difference, I am afraid those are the numbers I am getting based on the pickle file. Looking at the missing visits further, I am finding that the 156 visits from the old pickle added only 295 chips (10 visits don't add any science detectors), while the new 6 visits add only 15 chips (1 visit doesnt add any science detector). You can see the missing visits/chips that I am seeing here.

(quick check: aside from the slicer bug, the increased grid resolution is also removing some visits that were in the older file. The debugged-but-same-resolution-as-before gives 2277 total visits and therefore the older file has 123 extra visits.)

I could produce the pickle list that contains the old visits that are not in the new list but now corresponding to no chipNames. I am unsure what can be done with the additional six visits that are in new pickle if we cannot increase the number of visits or change the histIDs. Please let me know what you think.

egawiser commented 7 years ago

Just to clarify, since the visits that are being dropped or added are peripheral to the DC1 region and involve few chips, they aren't important for DC1 data quality. Hence it would be fine to skip the 6 new-only visits entirely and to reduce the 156 old-only visits to 0 or 1 chips simulated (as Humna suggested), if that fixes the workflow issue. @TomGlanzman, is it ok for Humna's pickle file to include a visit with 0 chips to be simulated? If not, she can just pick the first old simulated chip and repeat that in the new pickle file.

johnrpeterson commented 7 years ago

i’m confused by this thread. why not just simulate all the chips? if the chip is empty then phosim will skip it.

john

On Apr 10, 2017, at 2:04 PM, humnaawan notifications@github.com<mailto:notifications@github.com> wrote:

@TomGlanzmanhttps://github.com/TomGlanzman The "pixNum" key is expected. As I mentioned above, I needed to track the pixel numbers to see which chips were being triggered. I resaved the pickle without the pixNum array and the file size has reduced to the older, 2MB. If you prefer, you can use the latter; available herehttps://github.com/humnaawan/DC1-Regions/blob/master/chipsPerVisitData/2017-04-10_resaved_chipPerVisitData_fID1447_RandomDitherFieldPerVisit_randomRotDithered_nside1024_192103NonWFChipsToSimulate.pickle.

As for the visit difference, I am afraid those are the numbers I am getting based on the pickle file. Looking at the missing visits further, I am finding that the 156 visits from the old pickle added only 295 chips (10 visits don't add any science detectors), while the new 6 visits add only 15 chips (1 visit doesnt add any science detector). You can see the missing visits/chips that I am seeing herehttps://github.com/humnaawan/DC1-Regions/blob/master/notebooks/DC1%20Chip%20List%20-%20Comparison%20New%20vs.%20Old%203.ipynb.

(quick check: aside from the slicer bug, the increased grid resolution is also removing some visits that were in the older file. The debugged-but-same-resolution-as-before gives 2277 total visits and therefore the older file has 123 extra visits.)

I could produce the pickle list that contains the old visits that are not in the new list but now corresponding to no chipNames. I am unsure what can be done with the additional six visits that are in new pickle if we cannot increase the number of visits or change the histIDs. Please let me know what you think.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/29#issuecomment-293030592, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8ixBT50kfQZ2E8PjFDbzdyueZtLLks5rum80gaJpZM4MyJFB.

cwwalter commented 7 years ago

The chips aren't empty, but we are restricting ourselves to the sensors where we reach the required depth after dithering (for CPU time reasons).

johnrpeterson commented 7 years ago

ok, but then just remove sectors of the base catalog where you aren’t going to get enough depth (i.e. work in ra/dec space rather than chip space)?

On Apr 10, 2017, at 2:17 PM, Chris Walter notifications@github.com<mailto:notifications@github.com> wrote:

The chips aren't empty, but we are restricting ourselves to the sensors where we reach the required depth after dithering (for CPU time reasons).

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/29#issuecomment-293034252, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8i0Mo7zDk3UjqfQUQOo7pwOWTlWoks5runIogaJpZM4MyJFB.

TomGlanzman commented 7 years ago

@humnaawan and @egawiser : Please see the edit I made to my last post regarding the content and order of the revised pickle file. Visits with no science sensors are fine (and we already have some examples). With regard to the six new visits, if they are appended to the end of the original 2395, that would be fine.

The workflow assigns a fixed, indexed processing "slot" for each visit. The slot-visit correspondence must be retained for my quick-and-dirty overlay fix to work, thus the request that order be preserved in the pickle.

So an acceptable pickle could contain 2395+6 visits, where the original 2395 appear in the same order as previously, and each visit will have a revised list of chipnames to simulate. The list of chipnames may either be a complete list for the visit, or just those added since the original list (but I will need to know which, as it will determine how the workflow will adjust). Any sensors already simulated but which do not appear in the new pickle will be retained - and should do no harm.

cwwalter commented 7 years ago

ok, but then just remove sectors of the base catalog where you aren’t going to get enough depth (i.e. work in ra/dec space rather than chip space)?

I'll let the people like Tom and Jim doing the processing answer definitively but as I recall the issue is related to not having to generate different possible catalogs for each dithered pointing.

humnaawan commented 7 years ago

@TomGlanzman I have reformatted the new pickle data such that the the histIDs follow the order from the old pickle file, with the extra visits appended at the end. Please see the file here.

I have checked and with this new file, we have a total of 2401 visits. We dont have any visits from the old file that are not in the new one, and we have the extra 6 visits. You can see the updated notebook here.

Please let me know if there are any problems.

johnrpeterson commented 7 years ago

normally, you’d make a different catalog for each pointing anyways due to variable and/or moving object, but even if you didn’t you would can just shift the pointing position for the dither. but what i am saying is just remove the regions of the sky (i.e. in ra/dec catalog-space) that you don’t want. it should be really easy. so there has to be something wrong being done here.

john

On Apr 10, 2017, at 4:38 PM, Chris Walter notifications@github.com<mailto:notifications@github.com> wrote:

ok, but then just remove sectors of the base catalog where you aren’t going to get enough depth (i.e. work in ra/dec space rather than chip space)?

I'll let the people like Tom and Jim doing the processing answer definitively but as I recall the issue is related to not having to generate different possible catalogs for each dithered pointing.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/SSim_DC1/issues/29#issuecomment-293071723, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJbT8vVeK0wTCHoKP73UlE05Mrop2Rkrks5rupNggaJpZM4MyJFB.

egawiser commented 7 years ago

@johnrpeterson, this strikes me as simply a difference of approach, but your advice would be very helpful in figuring out the best approach going forwards. If we just used the ra,dec region covered by each dithered FOV to make its instance catalog, wouldn't we lose the ability to simulate ghosts etc. from bright stars outside the FOV? (Admittedly, that's probably not crucial for DC1, since we're setting all stars brighter than 10th mag to mag=10 in order to save CPU time.) Is there a harm to saving further CPU time by not simulating sensors covered by a dithered pointing that fall outside our chosen survey region? If we culled the instance catalog as you suggested, simulating those sensors would still populate them with photons from the moon etc., so I would naively think it could use significant CPU time. And then we'd toss those sensors out before running the DM stack to reduce the "data". I'm quite new to both PhoSim and the DM stack so may be ignorant of some relevant issues and features.

johnrpeterson commented 7 years ago

On Apr 11, 2017, at 9:55 AM, Eric Gawiser notifications@github.com<mailto:notifications@github.com> wrote:

@johnrpetersonhttps://github.com/johnrpeterson, this strikes me as simply a difference of approach, but your advice would be very helpful in figuring out the best approach going forwards.

yes, but i just mean you are already figuring out which regions in ra/dec space you don’t want. its just adds an extra step with many possible errors to then figure out which chips that corresponds to. its ten times easier to just let phosim do that step.

If we just used the ra,dec region covered by each dithered FOV to make its instance catalog, wouldn't we lose the ability to simulate ghosts etc. from bright stars outside the FOV? (Admittedly, that's probably not crucial for DC1, since we're setting all stars brighter than 10th mag to mag=10 in order to save CPU time.)

yes, you are right. that would be an advantage of the approach now. i suppose you’d have to keep the bright sources.

Is there a harm to saving further CPU time by not simulating sensors covered by a dithered pointing that fall outside our chosen survey region? If we culled the instance catalog as you suggested, simulating those sensors would still populate them with photons from the moon etc., so I would naively think it could use significant CPU time. And then we'd toss those sensors out before running the DM stack to reduce the "data". I'm quite new to both PhoSim and the DM stack so may be ignorant of some relevant issues and features.

no, this is not an issue because in the catalog there is a “minsource” command for this very purpose. if you set it to 1 (also the default), then that means that phosim will not simulate the chip with the background light on it, if there isn’t at least one source.

john

TomGlanzman commented 7 years ago

Thanks @humnaawan , have done a few simple checks on the file and all looks good.

I have decided to let the current set of sensors in the DC1 project go ahead and finish. Meanwhile, will retool the workflow for adding in the new sensors.

cwwalter commented 7 years ago

The 2nd pass was successfully completed for DC1.