Closed fjaviersanchez closed 2 years ago
hmm so bad news, or not too bad news? Ideally we would be aiming for strict equality I presume, but there are some random numbers that are not controllable.
I'd be interested in seeing the flux or magnitude at which these numbers diverge. I would think that at the bright end, these numbers would agree. Looking at the realized fluxes for the source in the centroid files, the values do match at the outset, but eventually the random sequences diverge. imsim draws the objects roughly in order from brighter to dimmer, so the dimmer sources will show differences in the detections and measured properties.
There should not be any random numbers; if there are (and we can localise the discrepancy) I'll try to get them fixed.
In the bad old days we always blamed things like this on the floating point arithmetic (e.g. 64 bit stores and 80 bit registers), and while this is much better, are we sure that running identical code on different processors will result in identical results?
R
Hi Robert, The random sequences I referred to are the ones generated by the random number generators used by imSim to render these sensor-visits. I think that we are explicitly seeding every rng in the simulation code. Nonetheless, it appears that we can get different results on different processors, even if the architectures are the same.
To see where the random sequences started to diverge I looked at the "centroid" files for R02_S21
on each of the three platforms. In these files, we record the model_flux
in ADU, which is the integral of the extincted SED over the bandpass using our standard throughputs, and the realized_flux
, which is a Poisson draw using that model value. Here are plots of the fractional difference (realized_flux - model_flux)/model_flux
vs object #
(the order in which each object is rendered).
The leftmost plot shows the first 5000 objects drawn; the middle plot is a zoom in where the grid value of realized_flux
diverges from the cori and theta values (at object 830), and the rightmost plot is a zoom in where theta and cori diverge (at object 4465). In both cases, the divergences occur while rendering the RandomKnots
galaxy components. There are several branches in the code to choose different rendering options (no sensor model, simplified SED, etc.) and these branches are selected based on floating point comparisons. Taking a different branch will definitely cause the random sequences to differ thereafter.
@jchiang87 so you mean that floating-point comparisons are actually the root cause here? Is it expected in terms of numerical precision needed for these comparison, or in terms of accumulated floating point errors?
When I first read Jim's comment it made sense, but now it doesn't! Those floating point comparisons are going to lead to divergent code paths based on the randoms, but I don't understand why they are not deterministic.
so you mean that floating-point comparisons are actually the root cause here? Is it expected in terms of numerical precision needed for these comparison, or in terms of accumulated floating point errors?
That's my suspicion, but I don't know for a fact that that's the cause. I would, of course, be happy to hear alternative explanations that actually make sense.
I don't understand why they are not deterministic.
I reckon the reason would be the same as what motivated this comment:
In the bad old days we always blamed things like this on the floating point arithmetic (e.g. 64 bit stores and 80 bit registers), and while this is much better, are we sure that running identical code on different processors will result in identical results?
Of course, for a given processor, the outcome are deterministic. They're just different among different processors, which is what the plot above shows.
Are the cori and ANL KNL architectures different?
You should get the same answer on the two systems.
After chatting with @johannct, we think that this may break DIA processing. Would it be possible to run DIA using as a first epoch one realization (say, theta) and as a second epoch a different one (say grid or cori)?
and I am not sure what it does to coadd..... I'd feel less worried if we could understand what is going on better.... From a cursory look at Javier's table it seems that GRID is often closer to THETA than CORI. As a matter of fact looking at R02_S21 chosen by Jim, Javi's numbers do not give a hint that GRID diverges significantly before the other two
If there is a site dependence then it is in principle possible that the imsim behavior on the grid depends on the node where the job is ran. I think that this should be understood (and solved ?) before we start the production.
I compared the measured base_SdssShape
magnitudes from several visits and I'm getting consistent results (with scatter but it looks like they are unbiased).
I have other visits as well if you are curious.
I'm putting a link to a preliminary notebook here
thanks @fjaviersanchez ! So 1/ Jim is right that the difference are very small at the bright end (though not strictly 0), and 2/ at the faint end (mag ~23 so not even the single visit depth) we can reach 10% absolute, ~which seems quite a worrying issue now.~ edit: no it is not, at mag=23 and for a 27-deep baseline one needs about 0.3 deltamag to get a DIA detection at 5 sigma (thx @rearmstr for the elementary maths reminder!)
There is something that I do not understand; if the galaxy knots rendering is chosen at run time and if it depends on a random number, it means that the same galaxy will not necessarily get the same knots rendering in different visits even if they are simulated on the same computing architecture ? If this is the case, it looks like it is a serious problem, no ?
There is something that I do not understand; if the galaxy knots rendering is chosen at run time and if it depends on a random number, it means that the same galaxy will not necessarily get the same knots rendering in different visits even if they are simulated on the same computing architecture ? If this is the case, it looks like it is a serious problem, no ?
Hmm... I think the random knot information is supposed to be seeded from UID info. So, I think it is supposed to the same each time. But, I wonder if the RNGs we are using are guaranteed to give the same sequence on different architectures. @EiffL Can probably comment more.
PRNGs should never be written with a system dependence, hopefully this is not the case here.
But the sequence seems to be identical for a while, before diverging, even for the GRID. So the generic handling of the PRNG does not seem completely faulty, no? Moreover, divergence occurs even on very similar architectures, like the KNL in Theta and Cori. I am leaning toward Jim's hypothesis that this has to do with uncontrolled numerical computations used in code bifurcations....
It's definitely the case that we make decisions on how to handle very bright and very dim objects and that could depend on machine issues. So, when we get to one of those decisions, the sequence can diverge as the code is written now. But, I thought I saw someone say that things were diverging when we did the random knots. That is what surprised me.
😲 ok, I didn't foresee this kind of issues... Indeed @boutigny we took extra care when we implemented the knots to make sure they have their own RNG that is seeded based on the galaxy UID, in principle resulting in the same sequence of knots positions at every visit, no matter what else happens in the simulation code or in what order these objects are drawn. So that the same galaxy always appear the same way in different visits. Here is where this happens: https://github.com/lsst/sims_GalSimInterface/blob/52aa252a919e8e5ec5145d7bfb4ab3fb755aaa3b/python/lsst/sims/GalSimInterface/galSimInterpreter.py#L441
I really don't see what could go wrong here. This object-specific RNG is only used to sample the positions, and then a global rng is used to draw the Poisson realization here: https://github.com/lsst/sims_GalSimInterface/blob/52aa252a919e8e5ec5145d7bfb4ab3fb755aaa3b/python/lsst/sims/GalSimInterface/galSimInterpreter.py#L303
Thanks @EiffL. I agree, I don't see any reason why this could go wrong. Reading again what @jchiang87 wrote, we need to investigate things upstream when the rendering option is selected.
After chatting with @johannct, we think that this may break DIA processing. Would it be possible to run DIA using as a first epoch one realization (say, theta) and as a second epoch a different one (say grid or cori)?
I don't understand. What would "break" DIA processing?
This current issue is all really about are we simulating the images the same across different systems.
We are using the DM Science Pipelines code as one way to generate summary numbers to do this comparison. But there's really nothing about the Science Pipelines that's fundamentally related here. And there's no particular sign that there's anything egregiously wrong in the generated images that would lead to any problems in the DM Science Pipelines code processing. Adding DIA processing to generate some more summary numbers seems likely to add another layer of numbers but not necessarily bring additional insight.
@rmjarvis What is the underlying random number generator being used in GalSim? Should the sequence be identical independent of architecture?
Yes. It's the random number generator from a specific boost version. We copied over the parts we needed and ship it along with GalSim so as not to be dependent on user-installed boost version. So it should be the same for any system.
I should add that we even have unit tests to check this. So any system where the test suite passes is getting the same random number sequence for some particular seed as I get on my laptop.
Grid image - Cori image for Visit 00479028_R10_S22 (Grey band at top and just a bit at the bottom are artifacts from the non-data regions of the image arrays.)
😱
Theta image - Cori image for Visit 00479028_R10_S22
(Grey band at top and just a bit at the bottom are artifacts from the non-data regions of the image arrays.)
Above are the diffs of {grid, theta} - cori image for R10, S22 for this visit. Scales are the same on the two diffs. Units are counts.
Machine | Mean | Median | Std Deviation |
---|---|---|---|
cori | 5146.742319443647 | 5494.000000000000 | 2333.088659641896 |
grid | 5146.748280244715 | 5494.000000000000 | 2333.422048109394 |
theta | 5146.750324922449 | 5494.000000000000 | 2333.422655712899 |
cori - cori | 0.000000000000 | 0.000000000000 | 0.000000000000 |
grid - cori | 0.005960801068 | 0.000000000000 | 11.866331862836 |
theta-cori | 0.008005478803 | 0.000000000000 | 11.611791514507 |
The grey bands at the top (thick) and bottom (much thinner) of the diffs exactly agreeing are artifacts of my rushed creation of the diffs. I just wrote out basic FITS files instead of including the data regions.
Science image (For the record this is Cori, but you wouldn't be able tell the difference).
There are places in the code where we make a decision based on some calculation, like whether to switch to FFT (if saturated and really bright), or whether to switch to a simpler SED and turn off the silicon sensor (faint objects that are only there to provide useful "fluff" for blending).
If these calculations have slight numerical differences, then particular objects could have a different choice on different machines. This on its own wouldn't be so bad (some objects would be different, but maybe not so many), but if one of them exercises the random number generator and the other doesn't then all objects after that will have a different random number sequence.
We had talked about starting new random number sequences for each object to try to avoid this, but I don't know if it was ever implemented. I think it was probably discussion for improvements for DC3, not something we were planning to get in for DC2.
We had talked about starting new random number sequences for each object to try to avoid this, but I don't know if it was ever implemented. I think it was probably discussion for improvements for DC3, not something we were planning to get in for DC2.
We didn't do this for this version.
What I would expect if that is all that is happening is the following: if you watch the centroid file things should track exactly until you hit a bifurcation point and then they will diverge. If it is really true that this is happening for when we use a random-knot then that is different and not expected. Someone should clarify if that is really the case.
Do you mean the random knots are the bifurcation points? Or that everything is the same except for the random knot galaxies?
No.. sorry. I mean as far as I know random knots are not bifurcation points. Those are given in the instance file. The bifurcation points should be if we decide to use FFTs or of things are dim enough we decide to use simple SEDs and/or skip the sensor model.
In principle we can ensure that all the bifurcations consume the same number of randoms, even if for nothing in some cases. I do not know whether it is easier to implement than a new seed for each object.
For the R10, S22 analyzed above, the 'Realized flux' in the centroid files is the same for the first object but is different starting with the second object and for all subsequent.
(The objects are ordered in decreasing brightness.)
Can you check a few more? These bifurcations shouldn't happen in every case. Only if you hit a machine precision issue right on the edge of a the comparison.
I don't think my statement is inconsistent with @jchiang87 's post above which was looking at relative flux variation.
https://github.com/LSSTDESC/DC2-production/issues/375#issuecomment-549078508
I think that the absolute flux variation is present for all objects, even if it's minimal for the brightness objects.
If it's right away, that makes it not too hard to test. We can add a bunch of print statements saying what is going on at each step and what the repr of the random number generator is.
Then break out after say 3 objects. This should be plenty to find where the difference starts to happen.
I've looked at 4 more comparisons. They all differ in 'Realized flux' starting with the second object.
OK thanks. So this is not just a machine precision error. We went through this before and @jchiang87 had a version of the code where he traced the RNG repr (and I thought we confirmed this was OK..).
Jim, do you still have that test?
Jim, do you still have that test?
Probably, but I'd have to dig it up. I'm way busy with camera stuff and would like to comment on this thread but there a lot of posts being made in rapid succession that my comments are out of date by the time I type them. It would be good to step back and make a considered assessment of what's happening and present a complete picture rather than make a bunch of smaller posts that tend to fuel speculation. I've been hoping to find the time to do exactly that.
Here are statistics for pair-wise comparisons between the centroid file contents for the sensor-visits in common among all three sites. The columns are CCD, # of mismatched model fluxes, # of mismatched realized fluxes, and the clipped standard deviation of the pulls of mismatched realized fluxes. The pull values are the differences between realized fluxes divided by the sqrt of 2*model_flux.
cori vs theta
R01_S00 0 176521 1.05
R01_S01 0 161391 1.05
R01_S02 0 148017 1.05
R01_S10 0 157093 1.05
R02_S21 0 160852 1.05
R02_S22 0 161841 1.05
R03_S00 0 165664 1.05
R03_S01 0 168696 1.05
R10_S12 0 154637 1.05
R10_S20 0 154505 1.05
R10_S21 0 166137 1.05
R10_S22 0 165562 1.05
cori vs grid
R01_S00 0 171513 1.05
R01_S01 0 152863 1.05
R01_S02 0 149112 1.05
R01_S10 0 157093 1.05
R02_S21 0 160733 1.05
R02_S22 0 167198 1.05
R03_S00 0 142698 1.05
R03_S01 0 168433 1.05
R10_S12 0 153826 1.05
R10_S20 0 159016 1.05
R10_S21 0 126806 1.06
R10_S22 0 165448 1.05
grid vs theta
R01_S00 0 176324 1.05
R01_S01 0 161739 1.05
R01_S02 0 120668 1.06
R01_S10 0 0 N/A
R02_S21 0 153893 1.05
R02_S22 0 167019 1.05
R03_S00 0 165738 1.04
R03_S01 0 168494 1.05
R10_S12 0 158254 1.05
R10_S20 0 159158 1.05
R10_S21 0 166285 1.05
R10_S22 0 151731 1.05
The model fluxes match, hence the zeros in the second column, and the realized fluxes are consistent with Poisson statistics.
As a check of the FITS images for the three pairs of sites, I compute the difference and summed images for each CCD. For the summed images, I subtract off the bias levels, convert to e-/pixel and subtract off the median pixel value for each segment so that the summed images have just object counts in electrons. Those counts will be used as the Poisson variance in the pull distributions. Since even rendered point sources contribute to several pixels in an image, single pixel statistics will be biased by correlations, so I rebin the data at several different scales. Here are the pull distributions for each pair of sites and for each of those NxN rebinnings. The legends give N and the stdev of the resulting distribution.
This has been finished.
I'm starting basic validation tests on DC2 2.2i and comparing the number of detected objects (left number) and detected with
deblend_nChild==0
(right number) folowed by the system in which the visit was simulated. These are the results so far (for visits that have been simulated in at least two systems):I'll keep adding tests results and I'll post a notebook here after I'm done.