Open Jammy2211 opened 3 years ago
Just got it 250000 samples into a CubeSliceSampler
If the proposal is bad (too few steps, no jumps are accepted), your live points can thin out to be linearly dependent. Then you get a numerical problem. I think you need to run with more steps.
I got the error with nsteps=100, but runs with 200 and 300 seem to be going ok so far.
Its probably worth me motivating why we are keen to have nsteps so low. Basically, we have found that random walk nested sampling / step based sampling is a very efficient method for our model-fitting problem. Using Dynesty, I did runs using a range of step values (5, 10, 15, 25, etc) and found that we got equivalent results, give or take, for all of the runs, but obviously the run with 5 steps was roughly twice as fast as 10, x5 as fast as 25, etc.
(I am sure a detailed inspection of the statistical properties of the results would show there are difference, but the log evidence values / parameter estimates did not vary much and no where near the level of systematic uncertanties that enter elsewhere).
Computationally, we are already pushing the limit of our resources, so getting the model-fit to produce a reasonable result fast is the primary concern. If UltraNest's step sampling results were on-par with what we currently get from Dynesty, this would be fine, as its the MPI support that is making me eager to make the swap to UltraNest. And before the crash, they look like they are going at a similar pace to one another, which is promising.
So, I guess my question is, do you think there would be a way to 'hack' a fix? If the error came up could the walk keep going until the proposal is 'good'? Reading the UltraNest docs, this seems to go against the philosophy of making sure the results are robust at the expense of run-time, but statistically question results is better than no results, right? :P
Actually, looking back at your error stack trace, I think the issue is elsewhere.
update_region tries to compute a new region. Sometimes this fails for numerical reasons (e.g., cannot invert the sample covariance). This is supposed to be okay, and the old region is just kept.
This is why this line https://github.com/JohannesBuchner/UltraNest/find/master#L1850 is wrapped in a try-catch block, accepting FloatingPointError and LinAlgError.
But in the region.compute_enlargement function here: https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/mlfriends.pyx#L814 There are a few asserts, which check that everything is kosher.
I think you hit the assert f > 0
here.
These should probably not raise AssertErrors, but LinAlgErrors.
But there is potentially another problem here https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L353 If the code is run with MPI, and an exception is raised, the program does not go into the MPI block. This may freeze the program, because the other processors wait for communication.
Perhaps it is safer for that function to catch any AssertErrors of compute_enlargement and set r and f to NaNs.
Sounds promising, and like it can be fixed :).
Just FYI, these runs are not using MPI (I haven't got that far yet). Just simple serial runs on my laptop before moving on to the HPC. Your response sounds like you know this, and are just thinking of the conseqeunces of MPI, but just to be clear!
Could you try this patch: https://github.com/JohannesBuchner/UltraNest/compare/feature-regionlinalgexceptions Perhaps you can try to retrigger the issue.
Im now trying to run the code from a GitHub repo clone, as the patch above is more recent than 3.2.0.
I get the following error message:
python imaging/searches/parametric/initialization/UltraNest.py
Traceback (most recent call last):
File "imaging/searches/parametric/initialization/UltraNest.py", line 95, in <module>
result = search.fit(model=model, analysis=analysis)
File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/abstract_search.py", line 310, in fit
self._fit(model=model, analysis=analysis, log_likelihood_cap=log_likelihood_cap)
File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/nest/ultranest.py", line 154, in _fit
import ultranest
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/__init__.py", line 7, in <module>
from .integrator import NestedSampler, ReactiveNestedSampler, read_file
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 20, in <module>
from ultranest.mlfriends import MLFriends, AffineLayer, ScalingLayer, find_nearby, WrappingEllipsoid, RobustEllipsoidRegion
ModuleNotFoundError: No module named 'ultranest.mlfriends'
I have run the following command:
python setup.py install
Am I missing another step to build the .pyx files?
I think you are running the code from the repo (perhaps it is in your PYTHONPATH). In that case, you need python3 setup.py build_ext --inplace
That did it thanks! Will let you know how these runs go.
New error:
Traceback (most recent call last):
File "imaging/searches/parametric/initialization/UltraNest.py", line 95, in <module>
result = search.fit(model=model, analysis=analysis)
File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/abstract_search.py", line 310, in fit
self._fit(model=model, analysis=analysis, log_likelihood_cap=log_likelihood_cap)
File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/nest/ultranest.py", line 198, in _fit
self.sampler.run(
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 2133, in run
for result in self.run_iter(
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 2336, in run_iter
region_fresh = self._update_region(
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 1860, in _update_region
r, f = _update_region_bootstrap(nextregion, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 382, in _update_region_bootstrap
if e is None:
UnboundLocalError: local variable 'e' referenced before assignment
Please try again, I updated the branch. Good to see you are reproducing it well :-)
I no longer get an error, but it would appear the fit stalls one way or another and stops progress (or begins to progress extremely slowly). Below is an extract of the outputs:
DEBUG:ultranest:iteration=1610, ncalls=9715, logz=-183186.59, remainder_fraction=1.3804%, Lmin=-183157.35, Lmax=-183156.76
DEBUG:ultranest:iteration=1613, ncalls=9733, logz=-183186.59, remainder_fraction=1.3238%, Lmin=-183157.35, Lmax=-183156.64
DEBUG:ultranest:iteration=1615, ncalls=9745, logz=-183186.59, remainder_fraction=1.2934%, Lmin=-183157.34, Lmax=-183156.64
DEBUG:ultranest:iteration=1618, ncalls=9763, logz=-183186.59, remainder_fraction=1.2390%, Lmin=-183157.34, Lmax=-183156.64
DEBUG:ultranest:iteration=1620, ncalls=9775, logz=-183186.59, remainder_fraction=1.2017%, Lmin=-183157.32, Lmax=-183156.53
DEBUG:ultranest:iteration=1580, ncalls=9997, logz=-183186.61, remainder_fraction=2.4999%, Lmin=-183157.62, Lmax=-183156.16
DEBUG:ultranest:iteration=2795, ncalls=16860, logz=-183185.50, remainder_fraction=5.8355%, Lmin=-183142.87, Lmax=-183140.64
So, I am guessing we averted the error, but are now in some sort of infinite loop where the method can't update the proposals in a way that improves the sampling?
The initial 5000 ncalls or so run just like they did before the error, so I think this confirms that the behaviour is due to the error, albeit in a way that now surpresses the Exception being called.
I will expirment with different nsteps but seems like this behaviour is general.
An identical rerun shows the same behaviour, but now it gets stuck at a logz around -216745:
DEBUG:ultranest:iteration=2893, ncalls=17409, logz=-216745.79, remainder_fraction=99.2916%, Lmin=-216676.78, Lmax=-216670.33
So I guess once it hits some sort of issue the sampler gets stuck making good proposals.
What does sampler.live_points_healthy say?
Can you dump out what the live points look like (sampler.region.u) and investigate if they are weird (e.g., linearly dependent)?
The efficiency printed to stdout should not go much below 1/nsteps.
You don't have region_filter enabled in the step sampler, do you?
If you look at the live points, perhaps you can pick to and try a linear interpolation between them and plot what transform+likelihood say that the current surface looks like. Some likelihoods have numerical noise, not sure if that is the case here as well.
Will explore your suggestions above, just a quick update. If I exit the search after 5000 ncalls (to output visualization and update the results) and then resume the run (which works fine when not using a stepsampler) I get the following error:
Traceback (most recent call last):
File "imaging/searches/parametric/initialization/UltraNest.py", line 95, in <module>
result = search.fit(model=model, analysis=analysis)
File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/abstract_search.py", line 310, in fit
self._fit(model=model, analysis=analysis, log_likelihood_cap=log_likelihood_cap)
File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/nest/ultranest.py", line 198, in _fit
self.sampler.run(
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 2136, in run
for result in self.run_iter(
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 2339, in run_iter
region_fresh = self._update_region(
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 1773, in _update_region
_update_region_bootstrap(self.region, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 388, in _update_region_bootstrap
raise e
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 365, in _update_region_bootstrap
r, f = region.compute_enlargement(
File "ultranest/mlfriends.pyx", line 853, in ultranest.mlfriends.MLFriends.compute_enlargement
File "<__array_function__ internals>", line 5, in inv
File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 545, in inv
ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 88, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
numpy.linalg.LinAlgError: Singular matrix
Yes, the first region building doesn't have the same try-catch protection at the moment.
Ok, so something is going very wrong. Here is what the graphic of the live points looks like once I think its stuck in this loop:
centre_0 : -0.100000000| -0.064399111 * -0.064399104 |+0.100000000
centre_1 : +0.00000000| +0.09817423 * |+0.10000000
elliptical_comps_0: +0.00000000| +0.08632405 * +0.08632411 |+0.10000000
elliptical_comps_1: +0.0000000| +0.1385928 * +0.1385930 |+1.0000000
einstein_radius : +0.000003| +1.789529 * +1.789530 |+2.999997
elliptical_comps_0: +0.0000000| +0.1489596 * +0.1489606 |+0.1999996
elliptical_comps_1: +0.000000| +0.069198 * +0.069200 |+0.100000
centre_0 : +0.00000| +0.22859 * +0.22861 | +1.00000
centre_1 : -0.10000| -0.01899 * -0.01898 | +0.10000
elliptical_comps_0: +0.0000| +0.0731 * +0.0732 | +0.1000
elliptical_comps_1: -1.0000| -0.1448 * -0.1445 | +1.0000
intensity : +0.000| +0.274 * +0.275 | +1.000
effective_radius : +0.000| +1.304 * +1.314 | +10.000
sersic_index : +0.50| +1.30 ** +1.32 | +5.00
Essentially, all points are relocating to a very narrow volumne of parameter space.
Before the weird change, this is what the distribution of points look like:
centre_0 : -0.48| -0.09 * **** * * ** +0.23 | +0.48
centre_1 : +0.000| * ** *** ** * * +0.216 | +0.475
elliptical_comps_0: -1.0| -0.3 * *** * * * +0.4 | +1.0
elliptical_comps_1: -1.0| -0.7 * * **** * * ** +0.4 | +1.0
einstein_radius : +3.0e-06| * * * ** ****** * * +1.9e+00 | +3.0e+00
elliptical_comps_0: -0.2| * ** ******** *** **| +0.2
elliptical_comps_1: -0.2| -0.2 * ** * * ********* * +0.1 | +0.2
centre_0 : -1.0| -0.5 * * * ** ** ***** ***** * +0.5 | +1.0
centre_1 : -1.0| -0.2 ** ****** * ** *** +0.2 | +1.0
elliptical_comps_0: -1.0| -0.8 * * *** * * *********** ** * * +0.7 | +1.0
elliptical_comps_1: -1.0| -0.4 * ** ******** *** * ** * ***** +0.2 | +1.0
intensity : +0.000| * * * *** ********** * * *** ** * * * * * * * +0.575 | +1.000
effective_radius : +0.0| * ** ** * ** * *** * ** ** ***** * * * * ** * * * * * ** * +21.3 | +30.0
sersic_index : +0.5| * ****** *********** * ** ** * * * *** +4.2 | +5.0
Inspection of the live points confirms they have all gone to nearly identical parameter values:
**-0.06439910960150218**,0.0981742436119151,0.08632410301916939,0.13859289808496217,1.7895292422024234,0.14895996568522524,0.06919889247653405,0.22860720639473817,-0.018982604147621138,0.0731811673505227,-0.14466221406946267,0.27091193109117495,1.316757351895962,1.3216381892831708,-181748.83074813336,4.87352554297685,-181743.9572225904,0.013407300660035554
**-0.06439910948998101**,0.09817424476134377,0.08632409372582933,0.13859290560568494,1.789529760717818,0.1489600398479064,0.06919842616049288,0.2286075817334351,-0.018979280111244237,0.073205608524242,-0.14462487346758082,0.27166517316803207,1.3223810942201535,1.317392364083193,-181748.6251748058,4.863276722281564,-181743.76189808353,0.016467236027840757
**-0.06439910958163868**,0.09817423949921626,0.08632409916221573,0.13859289654727142,1.7895293482497112,0.14895997541108358,0.06919882187257725,0.22860595597436698,-0.01898379379550144,0.07318884633614044,-0.14465690414799035,0.27108737740804345,1.317561980179545,1.323496276690078,-181748.37947281572,4.871132797875134,-181743.50834001784,0.021053665997334266
-0.06439910958497416,0.09817423886452462,0.08632410293142254,0.1385928899429708,1.7895295716246569,0.1489600060866913,0.0691988160017456,0.22860685588931237,-0.018977178670060144,0.07319491512674682,-0.1446632309122282,0.27107607507666215,1.3199135722780464,1.3199808005992157,-181748.30047552608,4.871292917504033,-181743.42918260858,0.02278430658410807
-0.06439910926134053,0.09817423686196158,0.08632409909958223,0.13859289672965816,1.789529538172654,0.14896007517408377,0.06919885296840655,0.22860635908257937,-0.018978588582163277,0.07319686145819135,-0.1446438206926631,0.2710535983560723,1.3191513190634658,1.3179546109555353,-181748.18307232799,4.871587181742363,-181743.31148514623,0.025622610628496332
-0.06439911033573704,0.09817424028203765,0.08632409951277857,0.13859289908077344,1.7895294976978384,0.14895996329886396,0.06919877027838545,0.22860519902372653,-0.018980970783291105,0.07317988024779151,-0.1446666714141247,0.27098896907166814,1.3171535494595326,1.3202207214663955,-181748.18197815598,4.872472910802454,-181743.30950524518,0.025650661515323282
-0.06439910820793321,0.09817425202587972,0.08632411092042633,0.13859295472924235,1.7895297771451393,0.14896030828065848,0.06919835047496598,0.22860609237492285,-0.018981908435017993,0.07320599260773228,-0.14463659403866524,0.2717679978805083,1.322373857715808,1.3155427219194662,-181748.00860936329,4.861887817266632,-181743.146721546,0.030506450530194327
-0.06439910931520619,0.0981742429899657,0.08632409429889384,0.1385928999815098,1.7895298760652651,0.1489599861814822,0.06919862912319524,0.2286070425321011,-0.0189798931669384,0.07320175516834752,-0.14462188257872574,0.2717931725305268,1.3199316918199293,1.3192562810976645,-181745.63782159035,4.861539051085719,-181740.77628253927,0.3265966512047664
-0.06439910937211618,0.0981742453728004,0.08632412642251558,0.13859292474973925,1.7895297098991287,0.1489600742741914,0.0691981328440126,0.22860825224388395,-0.01897755603324009,0.07321188669532465,-0.1446405075508717,0.2716357827345963,1.3190350169327705,1.3144189141096292,-181745.23177312632,4.863687288063371,-181740.36808583824,0.4901808433186678
It would seem that we are at some point suddenly replacing all live points with an extremely narrow set of values, whenever the numerical issue that was raising an exception previously occurs?
Inspecting the debug.log file reveals the following errors:
Traceback (most recent call last):
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 1863, in _update_region
r, f = _update_region_bootstrap(nextregion, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 388, in _update_region_bootstrap
raise e
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 365, in _update_region_bootstrap
r, f = region.compute_enlargement(
File "ultranest/mlfriends.pyx", line 860, in ultranest.mlfriends.MLFriends.compute_enlargement
numpy.linalg.LinAlgError: Distances are not positive
Traceback (most recent call last):
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 1865, in _update_region
nextregion.create_ellipsoid(minvol=minvol)
File "ultranest/mlfriends.pyx", line 1026, in ultranest.mlfriends.MLFriends.create_ellipsoid
FloatingPointError: invalid value encountered in sqrt
Traceback (most recent call last):
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 1863, in _update_region
r, f = _update_region_bootstrap(nextregion, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 388, in _update_region_bootstrap
raise e
File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 365, in _update_region_bootstrap
r, f = region.compute_enlargement(
File "ultranest/mlfriends.pyx", line 853, in ultranest.mlfriends.MLFriends.compute_enlargement
File "<__array_function__ internals>", line 5, in inv
File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 545, in inv
ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 88, in _raise_linalgerror_singular
raise LinAlgError("Singular matrix")
numpy.linalg.LinAlgError: Singular matrix
Still digging, but my feeling is that the errors above are essentially causing the method to converge on a very narrow range of parameter space (which has a much lower logz than the true solution).
It probably stems from issues with the likelihood function, but Dynesty and other samplers work OK for this problem.
Perhaps you can put a few prints in the step sampler code to see what is being proposed, what its likelihood was, and whether it was accepted. Then you should be able to see where it is running around.
The stepsampler can also store a log (log=open('stepsampler.log', 'w')) where it stores acceptance rates and other stuff (logstat and logstat_labels in the code). You can even plot it (stepsampler.plot('stepsamplerstats.pdf')).
It would be great if you can investigate why this happens, and perhaps we can recognise and circumvent such situations.
Digging further into this but not got any super great insights yet.
The one thing I've noted is that if I plot the active physical points, it is always ends up i a situation where multiple points have the same physical values:
[-3.08907767e-02 -1.61065791e-03 -3.22974652e-01 3.05938787e-01
1.83378711e+00 -3.87885334e-02 8.93126697e-03 1.81546880e-01
1.36110472e-01 4.92397997e-01 -1.52481447e-01 1.92147307e+01
4.73909575e+00 1.20995652e-02] # This is a live point
[-9.57800449e-02 -3.83242618e-02 -4.34228190e-01 -3.90178169e-01
2.24544749e+00 1.45616738e-01 -1.16783854e-01 -1.47338829e-01
1.81373990e-01 -1.33452392e-01 -2.06928418e-01 5.89996529e-01
1.26600550e+00 7.43064147e-01]
[-3.60315319e-03 8.13240141e-02 -3.98797747e-01 4.52679127e-02
2.95989926e+00 9.04596887e-02 9.61625401e-02 -1.29335419e-01
6.99180889e-02 2.35738067e-01 8.31830576e-01 7.76290474e+00
2.44269469e+00 4.81440154e-02]
**[-3.08907767e-02 -1.61065791e-03 -3.22974652e-01 3.05938787e-01
1.83378711e+00 -3.87885334e-02 8.93126697e-03 1.81546880e-01
1.36110472e-01 4.92397997e-01 -1.52481447e-01 1.92147307e+01
4.73909575e+00 1.20995652e-02]] # This is an identical live point
My guess is that the try / except's are reusing samples one way or another, and that once this happens enough time I end up in a situation where all the live points go to the same physical sample.
This implies the stepsampler made n steps and no proposal was accepted. Very strange if that happens with the slice samplers, because the shrink the interval until a new point was accepted.
Perhaps put a print in the two branches of this if: https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/stepsampler.py#L595 It should tell you the proposed point (unew, pnew) and its likelihood Lnew, and whether it was accepted or rejected.
Such a print will produce lots of output, but it should be insightful to stare at the part where there are very many rejects (or just before where the Exception happens) and understand what is happening there.
Alternatively, you can also put this print into adjust_accept.
You may also want to print self.scale (and for the slice sampler self.interval). That should tell you about the proposal the stepsampler is considering.
Description
Around 10000 iterations into a model fit using the RegionMHSampler (nsteps=10) I get the errors below. I also have seen Error 2 for the CubeMHSampler. No issues for normal sampler or slice samplers, so something specific to a Region sampler.
This probably indicates some undeseriable behaviour in my LH function, but would be very tricky to fix via changing the LH function. I have tried the options that 'improve' my LH function, but to no avail.
We don't see this issue for MCMC / MultiNest, but I think the same issue comes up with Dynesty. We managed to hack a fix for Dynesty, by restarting the algorithm via a try and except (doesn't work so well here due to the backend meaning it reuses the results).
What I Did