DC2 Run 2.2i Early access, basic validation

fjaviersanchez commented 4 years ago

I'm starting basic validation tests on DC2 2.2i and comparing the number of detected objects (left number) and detected with deblend_nChild==0 (right number) folowed by the system in which the visit was simulated. These are the results so far (for visits that have been simulated in at least two systems):

###
{'visit': 479028, 'filter': 'i', 'raftName': 'R01', 'detectorName': 'S00', 'detector': 0}
2417 2255 THETA
2410 2242 CORI
2404 2240 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R01', 'detectorName': 'S01', 'detector': 1}
1968 1833 THETA
1964 1831 CORI
1968 1833 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R01', 'detectorName': 'S02', 'detector': 2}
2009 1892 THETA
2006 1891 CORI
2005 1892 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R01', 'detectorName': 'S10', 'detector': 3}
1821 1708 THETA
1831 1719 CORI
1821 1708 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R01', 'detectorName': 'S11', 'detector': 4}
1882 1780 THETA
1890 1789 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R01', 'detectorName': 'S12', 'detector': 5}
2044 1934 THETA
2029 1920 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R01', 'detectorName': 'S20', 'detector': 6}
1761 1664 THETA
1763 1666 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R01', 'detectorName': 'S21', 'detector': 7}
1892 1797 THETA
1881 1789 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R01', 'detectorName': 'S22', 'detector': 8}
1816 1705 THETA
1816 1705 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R02', 'detectorName': 'S00', 'detector': 9}
2090 1968 THETA
2090 1968 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R02', 'detectorName': 'S01', 'detector': 10}
2092 1970 THETA
2096 1973 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R02', 'detectorName': 'S02', 'detector': 11}
1900 1785 THETA
1879 1771 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R02', 'detectorName': 'S10', 'detector': 12}
2127 1981 THETA
2123 1978 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R02', 'detectorName': 'S11', 'detector': 13}
2175 2048 THETA
2171 2046 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R02', 'detectorName': 'S12', 'detector': 14}
2089 1954 THETA
2079 1946 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R02', 'detectorName': 'S20', 'detector': 15}
1759 1647 THETA
1759 1648 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R02', 'detectorName': 'S21', 'detector': 16}
2101 1965 THETA
2083 1955 CORI
2103 1970 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R02', 'detectorName': 'S22', 'detector': 17}
1941 1828 THETA
1947 1834 CORI
1947 1836 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R03', 'detectorName': 'S00', 'detector': 18}
1893 1771 THETA
1888 1765 CORI
1888 1770 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R03', 'detectorName': 'S01', 'detector': 19}
1861 1774 THETA
1870 1780 CORI
1863 1774 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R03', 'detectorName': 'S02', 'detector': 20}
2033 1923 THETA
2049 1937 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R03', 'detectorName': 'S10', 'detector': 21}
1911 1795 THETA
1908 1791 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R03', 'detectorName': 'S11', 'detector': 22}
1753 1665 THETA
1741 1659 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R03', 'detectorName': 'S12', 'detector': 23}
1812 1718 THETA
1805 1713 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R03', 'detectorName': 'S20', 'detector': 24}
1823 1718 THETA
1823 1721 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R03', 'detectorName': 'S21', 'detector': 25}
1736 1636 THETA
1727 1628 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R03', 'detectorName': 'S22', 'detector': 26}
1817 1707 THETA
1823 1716 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R10', 'detectorName': 'S00', 'detector': 27}
2112 1984 THETA
2095 1975 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R10', 'detectorName': 'S01', 'detector': 28}
1923 1807 THETA
1907 1797 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R10', 'detectorName': 'S02', 'detector': 29}
1957 1852 THETA
1954 1848 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R10', 'detectorName': 'S10', 'detector': 30}
2243 2079 THETA
2227 2060 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R10', 'detectorName': 'S11', 'detector': 31}
2090 1952 THETA
2093 1958 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R10', 'detectorName': 'S12', 'detector': 32}
2115 1983 THETA
2106 1977 CORI
2108 1980 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R10', 'detectorName': 'S20', 'detector': 33}
1988 1868 THETA
1987 1867 CORI
1983 1863 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R10', 'detectorName': 'S21', 'detector': 34}
1921 1794 THETA
1915 1789 CORI
1917 1794 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R10', 'detectorName': 'S22', 'detector': 35}
1973 1862 THETA
1981 1863 CORI
1991 1874 GRID
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R11', 'detectorName': 'S00', 'detector': 36}
1864 1749 THETA
1866 1750 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R11', 'detectorName': 'S01', 'detector': 37}
2192 2063 THETA
2177 2052 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R11', 'detectorName': 'S02', 'detector': 38}
1856 1751 THETA
1847 1745 CORI
###
{'visit': 479028, 'filter': 'i', 'raftName': 'R11', 'detectorName': 'S10', 'detector': 39}
1796 1683 THETA
1800 1685 CORI
###

I'll keep adding tests results and I'll post a notebook here after I'm done.

johannct commented 4 years ago

hmm so bad news, or not too bad news? Ideally we would be aiming for strict equality I presume, but there are some random numbers that are not controllable.

jchiang87 commented 4 years ago

I'd be interested in seeing the flux or magnitude at which these numbers diverge. I would think that at the bright end, these numbers would agree. Looking at the realized fluxes for the source in the centroid files, the values do match at the outset, but eventually the random sequences diverge. imsim draws the objects roughly in order from brighter to dimmer, so the dimmer sources will show differences in the detections and measured properties.

RobertLuptonTheGood commented 4 years ago

There should not be any random numbers; if there are (and we can localise the discrepancy) I'll try to get them fixed.

In the bad old days we always blamed things like this on the floating point arithmetic (e.g. 64 bit stores and 80 bit registers), and while this is much better, are we sure that running identical code on different processors will result in identical results?

jchiang87 commented 4 years ago

Hi Robert, The random sequences I referred to are the ones generated by the random number generators used by imSim to render these sensor-visits. I think that we are explicitly seeding every rng in the simulation code. Nonetheless, it appears that we can get different results on different processors, even if the architectures are the same.

To see where the random sequences started to diverge I looked at the "centroid" files for R02_S21 on each of the three platforms. In these files, we record the model_flux in ADU, which is the integral of the extincted SED over the bandpass using our standard throughputs, and the realized_flux, which is a Poisson draw using that model value. Here are plots of the fractional difference (realized_flux - model_flux)/model_flux vs object # (the order in which each object is rendered). centroid_479028_R02_S21_i_flux_comparison The leftmost plot shows the first 5000 objects drawn; the middle plot is a zoom in where the grid value of realized_flux diverges from the cori and theta values (at object 830), and the rightmost plot is a zoom in where theta and cori diverge (at object 4465). In both cases, the divergences occur while rendering the RandomKnots galaxy components. There are several branches in the code to choose different rendering options (no sensor model, simplified SED, etc.) and these branches are selected based on floating point comparisons. Taking a different branch will definitely cause the random sequences to differ thereafter.

johannct commented 4 years ago

@jchiang87 so you mean that floating-point comparisons are actually the root cause here? Is it expected in terms of numerical precision needed for these comparison, or in terms of accumulated floating point errors?

RobertLuptonTheGood commented 4 years ago

When I first read Jim's comment it made sense, but now it doesn't! Those floating point comparisons are going to lead to divergent code paths based on the randoms, but I don't understand why they are not deterministic.

jchiang87 commented 4 years ago

so you mean that floating-point comparisons are actually the root cause here? Is it expected in terms of numerical precision needed for these comparison, or in terms of accumulated floating point errors?

That's my suspicion, but I don't know for a fact that that's the cause. I would, of course, be happy to hear alternative explanations that actually make sense.

I don't understand why they are not deterministic.

I reckon the reason would be the same as what motivated this comment:

In the bad old days we always blamed things like this on the floating point arithmetic (e.g. 64 bit stores and 80 bit registers), and while this is much better, are we sure that running identical code on different processors will result in identical results?

Of course, for a given processor, the outcome are deterministic. They're just different among different processors, which is what the plot above shows.

cwwalter commented 4 years ago

Are the cori and ANL KNL architectures different?

salmanhabib commented 4 years ago

You should get the same answer on the two systems.

fjaviersanchez commented 4 years ago

After chatting with @johannct, we think that this may break DIA processing. Would it be possible to run DIA using as a first epoch one realization (say, theta) and as a second epoch a different one (say grid or cori)?

johannct commented 4 years ago

and I am not sure what it does to coadd..... I'd feel less worried if we could understand what is going on better.... From a cursory look at Javier's table it seems that GRID is often closer to THETA than CORI. As a matter of fact looking at R02_S21 chosen by Jim, Javi's numbers do not give a hint that GRID diverges significantly before the other two

boutigny commented 4 years ago

If there is a site dependence then it is in principle possible that the imsim behavior on the grid depends on the node where the job is ran. I think that this should be understood (and solved ?) before we start the production.

fjaviersanchez commented 4 years ago

I compared the measured base_SdssShape magnitudes from several visits and I'm getting consistent results (with scatter but it looks like they are unbiased).

I have other visits as well if you are curious.

fjaviersanchez commented 4 years ago

I'm putting a link to a preliminary notebook here

johannct commented 4 years ago

thanks @fjaviersanchez ! So 1/ Jim is right that the difference are very small at the bright end (though not strictly 0), and 2/ at the faint end (mag ~23 so not even the single visit depth) we can reach 10% absolute, ~which seems quite a worrying issue now.~ edit: no it is not, at mag=23 and for a 27-deep baseline one needs about 0.3 deltamag to get a DIA detection at 5 sigma (thx @rearmstr for the elementary maths reminder!)

boutigny commented 4 years ago

There is something that I do not understand; if the galaxy knots rendering is chosen at run time and if it depends on a random number, it means that the same galaxy will not necessarily get the same knots rendering in different visits even if they are simulated on the same computing architecture ? If this is the case, it looks like it is a serious problem, no ?

cwwalter commented 4 years ago

There is something that I do not understand; if the galaxy knots rendering is chosen at run time and if it depends on a random number, it means that the same galaxy will not necessarily get the same knots rendering in different visits even if they are simulated on the same computing architecture ? If this is the case, it looks like it is a serious problem, no ?

Hmm... I think the random knot information is supposed to be seeded from UID info. So, I think it is supposed to the same each time. But, I wonder if the RNGs we are using are guaranteed to give the same sequence on different architectures. @EiffL Can probably comment more.

salmanhabib commented 4 years ago

PRNGs should never be written with a system dependence, hopefully this is not the case here.

johannct commented 4 years ago

But the sequence seems to be identical for a while, before diverging, even for the GRID. So the generic handling of the PRNG does not seem completely faulty, no? Moreover, divergence occurs even on very similar architectures, like the KNL in Theta and Cori. I am leaning toward Jim's hypothesis that this has to do with uncontrolled numerical computations used in code bifurcations....

cwwalter commented 4 years ago

It's definitely the case that we make decisions on how to handle very bright and very dim objects and that could depend on machine issues. So, when we get to one of those decisions, the sequence can diverge as the code is written now. But, I thought I saw someone say that things were diverging when we did the random knots. That is what surprised me.

EiffL commented 4 years ago

😲 ok, I didn't foresee this kind of issues... Indeed @boutigny we took extra care when we implemented the knots to make sure they have their own RNG that is seeded based on the galaxy UID, in principle resulting in the same sequence of knots positions at every visit, no matter what else happens in the simulation code or in what order these objects are drawn. So that the same galaxy always appear the same way in different visits. Here is where this happens: https://github.com/lsst/sims_GalSimInterface/blob/52aa252a919e8e5ec5145d7bfb4ab3fb755aaa3b/python/lsst/sims/GalSimInterface/galSimInterpreter.py#L441

I really don't see what could go wrong here. This object-specific RNG is only used to sample the positions, and then a global rng is used to draw the Poisson realization here: https://github.com/lsst/sims_GalSimInterface/blob/52aa252a919e8e5ec5145d7bfb4ab3fb755aaa3b/python/lsst/sims/GalSimInterface/galSimInterpreter.py#L303

boutigny commented 4 years ago

Thanks @EiffL. I agree, I don't see any reason why this could go wrong. Reading again what @jchiang87 wrote, we need to investigate things upstream when the rendering option is selected.

wmwv commented 4 years ago

After chatting with @johannct, we think that this may break DIA processing. Would it be possible to run DIA using as a first epoch one realization (say, theta) and as a second epoch a different one (say grid or cori)?

I don't understand. What would "break" DIA processing?

This current issue is all really about are we simulating the images the same across different systems.

We are using the DM Science Pipelines code as one way to generate summary numbers to do this comparison. But there's really nothing about the Science Pipelines that's fundamentally related here. And there's no particular sign that there's anything egregiously wrong in the generated images that would lead to any problems in the DM Science Pipelines code processing. Adding DIA processing to generate some more summary numbers seems likely to add another layer of numbers but not necessarily bring additional insight.

cwwalter commented 4 years ago

@rmjarvis What is the underlying random number generator being used in GalSim? Should the sequence be identical independent of architecture?

rmjarvis commented 4 years ago

Yes. It's the random number generator from a specific boost version. We copied over the parts we needed and ship it along with GalSim so as not to be dependent on user-installed boost version. So it should be the same for any system.

rmjarvis commented 4 years ago

I should add that we even have unit tests to check this. So any system where the test suite passes is getting the same random number sequence for some particular seed as I get on my laptop.

wmwv commented 4 years ago

Grid image - Cori image for Visit 00479028_R10_S22 grid_minus_cori (Grey band at top and just a bit at the bottom are artifacts from the non-data regions of the image arrays.)

EiffL commented 4 years ago

😱

wmwv commented 4 years ago

Theta image - Cori image for Visit 00479028_R10_S22

theta_minus_cori (Grey band at top and just a bit at the bottom are artifacts from the non-data regions of the image arrays.)

wmwv commented 4 years ago

Above are the diffs of {grid, theta} - cori image for R10, S22 for this visit. Scales are the same on the two diffs. Units are counts.

Machine	Mean	Median	Std Deviation
cori	5146.742319443647	5494.000000000000	2333.088659641896
grid	5146.748280244715	5494.000000000000	2333.422048109394
theta	5146.750324922449	5494.000000000000	2333.422655712899
cori - cori	0.000000000000	0.000000000000	0.000000000000
grid - cori	0.005960801068	0.000000000000	11.866331862836
theta-cori	0.008005478803	0.000000000000	11.611791514507

wmwv commented 4 years ago

The grey bands at the top (thick) and bottom (much thinner) of the diffs exactly agreeing are artifacts of my rushed creation of the diffs. I just wrote out basic FITS files instead of including the data regions.

wmwv commented 4 years ago

Science image (For the record this is Cori, but you wouldn't be able tell the difference).

cori

rmjarvis commented 4 years ago

There are places in the code where we make a decision based on some calculation, like whether to switch to FFT (if saturated and really bright), or whether to switch to a simpler SED and turn off the silicon sensor (faint objects that are only there to provide useful "fluff" for blending).

If these calculations have slight numerical differences, then particular objects could have a different choice on different machines. This on its own wouldn't be so bad (some objects would be different, but maybe not so many), but if one of them exercises the random number generator and the other doesn't then all objects after that will have a different random number sequence.

We had talked about starting new random number sequences for each object to try to avoid this, but I don't know if it was ever implemented. I think it was probably discussion for improvements for DC3, not something we were planning to get in for DC2.

cwwalter commented 4 years ago

We had talked about starting new random number sequences for each object to try to avoid this, but I don't know if it was ever implemented. I think it was probably discussion for improvements for DC3, not something we were planning to get in for DC2.

We didn't do this for this version.

What I would expect if that is all that is happening is the following: if you watch the centroid file things should track exactly until you hit a bifurcation point and then they will diverge. If it is really true that this is happening for when we use a random-knot then that is different and not expected. Someone should clarify if that is really the case.

rmjarvis commented 4 years ago

Do you mean the random knots are the bifurcation points? Or that everything is the same except for the random knot galaxies?

cwwalter commented 4 years ago

No.. sorry. I mean as far as I know random knots are not bifurcation points. Those are given in the instance file. The bifurcation points should be if we decide to use FFTs or of things are dim enough we decide to use simple SEDs and/or skip the sensor model.

johannct commented 4 years ago

In principle we can ensure that all the bifurcations consume the same number of randoms, even if for nothing in some cases. I do not know whether it is easier to implement than a new seed for each object.

wmwv commented 4 years ago

For the R10, S22 analyzed above, the 'Realized flux' in the centroid files is the same for the first object but is different starting with the second object and for all subsequent.

(The objects are ordered in decreasing brightness.)

cwwalter commented 4 years ago

Can you check a few more? These bifurcations shouldn't happen in every case. Only if you hit a machine precision issue right on the edge of a the comparison.

wmwv commented 4 years ago

I don't think my statement is inconsistent with @jchiang87 's post above which was looking at relative flux variation.

https://github.com/LSSTDESC/DC2-production/issues/375#issuecomment-549078508

I think that the absolute flux variation is present for all objects, even if it's minimal for the brightness objects.

rmjarvis commented 4 years ago

If it's right away, that makes it not too hard to test. We can add a bunch of print statements saying what is going on at each step and what the repr of the random number generator is.

Then break out after say 3 objects. This should be plenty to find where the difference starts to happen.

wmwv commented 4 years ago

I've looked at 4 more comparisons. They all differ in 'Realized flux' starting with the second object.

cwwalter commented 4 years ago

OK thanks. So this is not just a machine precision error. We went through this before and @jchiang87 had a version of the code where he traced the RNG repr (and I thought we confirmed this was OK..).

Jim, do you still have that test?

jchiang87 commented 4 years ago

Jim, do you still have that test?

Probably, but I'd have to dig it up. I'm way busy with camera stuff and would like to comment on this thread but there a lot of posts being made in rapid succession that my comments are out of date by the time I type them. It would be good to step back and make a considered assessment of what's happening and present a complete picture rather than make a bunch of smaller posts that tend to fuel speculation. I've been hoping to find the time to do exactly that.

jchiang87 commented 4 years ago

Here are statistics for pair-wise comparisons between the centroid file contents for the sensor-visits in common among all three sites. The columns are CCD, # of mismatched model fluxes, # of mismatched realized fluxes, and the clipped standard deviation of the pulls of mismatched realized fluxes. The pull values are the differences between realized fluxes divided by the sqrt of 2*model_flux.

cori vs theta
R01_S00  0  176521  1.05
R01_S01  0  161391  1.05
R01_S02  0  148017  1.05
R01_S10  0  157093  1.05
R02_S21  0  160852  1.05
R02_S22  0  161841  1.05
R03_S00  0  165664  1.05
R03_S01  0  168696  1.05
R10_S12  0  154637  1.05
R10_S20  0  154505  1.05
R10_S21  0  166137  1.05
R10_S22  0  165562  1.05

cori vs grid
R01_S00  0  171513  1.05
R01_S01  0  152863  1.05
R01_S02  0  149112  1.05
R01_S10  0  157093  1.05
R02_S21  0  160733  1.05
R02_S22  0  167198  1.05
R03_S00  0  142698  1.05
R03_S01  0  168433  1.05
R10_S12  0  153826  1.05
R10_S20  0  159016  1.05
R10_S21  0  126806  1.06
R10_S22  0  165448  1.05

grid vs theta
R01_S00  0  176324  1.05
R01_S01  0  161739  1.05
R01_S02  0  120668  1.06
R01_S10  0  0  N/A
R02_S21  0  153893  1.05
R02_S22  0  167019  1.05
R03_S00  0  165738  1.04
R03_S01  0  168494  1.05
R10_S12  0  158254  1.05
R10_S20  0  159158  1.05
R10_S21  0  166285  1.05
R10_S22  0  151731  1.05

The model fluxes match, hence the zeros in the second column, and the realized fluxes are consistent with Poisson statistics.

As a check of the FITS images for the three pairs of sites, I compute the difference and summed images for each CCD. For the summed images, I subtract off the bias levels, convert to e-/pixel and subtract off the median pixel value for each segment so that the summed images have just object counts in electrons. Those counts will be used as the Poisson variance in the pull distributions. Since even rendered point sources contribute to several pixels in an image, single pixel statistics will be biased by correlations, so I rebin the data at several different scales. Here are the pull distributions for each pair of sites and for each of those NxN rebinnings. The legends give N and the stdev of the resulting distribution. diff_image_stats_cori_theta diff_image_stats_cori_grid diff_image_stats_grid_theta

katrinheitmann commented 2 years ago

This has been finished.

LSSTDESC / DC2-production

DC2 Run 2.2i Early access, basic validation #375