JohannesBuchner / UltraNest

Fit and compare complex models reliably and rapidly. Advanced nested sampling.
https://johannesbuchner.github.io/UltraNest/
Other
143 stars 30 forks source link

LinAlgError: degenerate live points / singular matrix with Metropolis sampler #25

Open Jammy2211 opened 3 years ago

Jammy2211 commented 3 years ago

Description

Around 10000 iterations into a model fit using the RegionMHSampler (nsteps=10) I get the errors below. I also have seen Error 2 for the CubeMHSampler. No issues for normal sampler or slice samplers, so something specific to a Region sampler.

This probably indicates some undeseriable behaviour in my LH function, but would be very tricky to fix via changing the LH function. I have tried the options that 'improve' my LH function, but to no avail.

We don't see this issue for MCMC / MultiNest, but I think the same issue comes up with Dynesty. We managed to hack a fix for Dynesty, by restarting the algorithm via a try and except (doesn't work so well here due to the backend meaning it reuses the results).

What I Did

**Error 1:**

DEBUG:ultranest:iteration=937, ncalls=9555, logz=1214.48, remainder_fraction=100.0000%, Lmin=1235.84, Lmax=1269.88
DEBUG:ultranest:iteration=940, ncalls=9555, logz=1214.86, remainder_fraction=100.0000%, Lmin=1236.39, Lmax=1269.88
DEBUG:ultranest:iteration=945, ncalls=9555, logz=1215.47, remainder_fraction=100.0000%, Lmin=1236.94, Lmax=1269.88

Mono-modal Volume: ~exp(-71.40)   Expected Volume: exp(-19.75) Quality: correlation length: 153 (+)

   positive degeneracy between centre_1 and centre_0: rho=0.76
   positive degeneracy between elliptical_comps_0 and centre_0: rho=0.91
   positive degeneracy between elliptical_comps_0 and centre_1: rho=0.78
   negative degeneracy between elliptical_comps_0 and centre_1: rho=-0.88
   negative degeneracy between elliptical_comps_0 and elliptical_comps_0: rho=-0.79
   negative degeneracy between elliptical_comps_1 and elliptical_comps_0: rho=-0.84
   positive degeneracy between elliptical_comps_1 and elliptical_comps_0: rho=0.77
   negative degeneracy between sersic_index and centre_1: rho=-0.89
   negative degeneracy between sersic_index and elliptical_comps_0: rho=-0.82
   negative degeneracy between sersic_index and elliptical_comps_0: rho=-0.77
   positive degeneracy between sersic_index and elliptical_comps_0: rho=0.85
   positive degeneracy between sersic_index and elliptical_comps_1: rho=0.91
centre_0          :     +0.00|                        +0.09  * ***********  +0.12                                                                                                                    |    +0.48
centre_1          :     +0.00|                                                                                      +0.06  *  **              * *   *********** * ***   *  +0.08                     |    +0.10
elliptical_comps_0:      -1.0|                                                         -0.2  **  *****  -0.1                                                                                         |     +1.0
elliptical_comps_1:     +0.00|                                                         +0.04  **              *   *   *** ** **** *****  *****   *  +0.07                                            |    +0.10
einstein_radius   :     +0.00|                                                          +1.18  *****  +1.24                                                                                          |    +3.00
elliptical_comps_0:     +0.00|                                                                                              +0.06  *   **  * * ** **************    **   * *  +0.08                  |    +0.10
elliptical_comps_1:     -0.20|               -0.15  ********  -0.13                                                                                                                                  |    +0.20
centre_0          :     +0.00|         +0.10  ***  +0.11                                                                                                                                             |    +1.00
centre_1          :     -1.00|                                                -0.34  * *****  -0.27                                                                                                  |    +1.00
elliptical_comps_0:      -1.0|                                            -0.4  ******  -0.3                                                                                                         |     +1.0
elliptical_comps_1:     +0.00|       +0.09  ***********     **  +0.19                                                                                                                                |    +1.00
intensity         :   +0.0000|                    +0.0018  ** ********  +0.0024                                                                                                                      |  +0.0100
effective_radius  :      +0.0|       +0.8  *********  +1.3                                                                                                                                           |    +10.0
sersic_index      :      +0.5|                                                                           +2.7  ************      ***  +3.2                                                           |     +5.0

DEBUG:ultranest:iteration=950, ncalls=9555, logz=1216.06, remainder_fraction=100.0000%, Lmin=1237.90, Lmax=1269.88
Traceback (most recent call last):
  File "imaging/searches/parametric/initialization/UltraNest.py", line 89, in <module>
    result = search.fit(model=model, analysis=analysis)
  File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/abstract_search.py", line 299, in fit
    self._fit(model=model, analysis=analysis, log_likelihood_cap=log_likelihood_cap)
  File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/nest/ultranest.py", line 185, in _fit
    self.sampler.run(
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/ultranest/integrator.py", line 2124, in run
    for result in self.run_iter(
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/ultranest/integrator.py", line 2323, in run_iter
    region_fresh = self._update_region(
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/ultranest/integrator.py", line 1802, in _update_region
    _update_region_bootstrap(self.region, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/ultranest/integrator.py", line 347, in _update_region_bootstrap
    r, f = region.compute_enlargement(
  File "ultranest/mlfriends.pyx", line 721, in ultranest.mlfriends.MLFriends.compute_enlargement
  File "<__array_function__ internals>", line 5, in inv
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 545, in inv
    ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 88, in _raise_linalgerror_singular
    raise LinAlgError("Singular matrix")
numpy.linalg.LinAlgError: Singular matrix

**Error 2:**

DEBUG:ultranest:iteration=559, ncalls=5543, logz=895.65, remainder_fraction=100.0000%, Lmin=911.25, Lmax=937.97
DEBUG:ultranest:iteration=560, ncalls=5553, logz=895.86, remainder_fraction=100.0000%, Lmin=911.44, Lmax=937.97
DEBUG:ultranest:iteration=562, ncalls=5573, logz=896.38, remainder_fraction=100.0000%, Lmin=912.40, Lmax=937.97
DEBUG:ultranest:iteration=564, ncalls=5593, logz=896.97, remainder_fraction=100.0000%, Lmin=912.84, Lmax=937.97
DEBUG:ultranest:iteration=565, ncalls=5603, logz=897.21, remainder_fraction=100.0000%, Lmin=912.92, Lmax=939.97
Traceback (most recent call last):
  File "imaging/searches/parametric/initialization/UltraNest.py", line 90, in <module>
    result = search.fit(model=model, analysis=analysis)
  File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/abstract_search.py", line 299, in fit
    self._fit(model=model, analysis=analysis, log_likelihood_cap=log_likelihood_cap)
  File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/nest/ultranest.py", line 185, in _fit
    self.sampler.run(
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/ultranest/integrator.py", line 2124, in run
    for result in self.run_iter(
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/ultranest/integrator.py", line 2323, in run_iter
    region_fresh = self._update_region(
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/ultranest/integrator.py", line 1869, in _update_region
    r, f = _update_region_bootstrap(nextregion, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/ultranest/integrator.py", line 347, in _update_region_bootstrap
    r, f = region.compute_enlargement(
  File "ultranest/mlfriends.pyx", line 727, in ultranest.mlfriends.MLFriends.compute_enlargement
AssertionError: (-1.916767446404755e-21, 23, 15, array([[-4.44208167e-03,  2.92142064e-01, -2.65329781e-01,
        -8.44868962e-02, -4.42092643e-02, -2.19140765e-02,
        -6.99844548e-03,  7.87475158e-02, -1.50341519e-01,
        -9.25057776e-02,  1.80759457e-02,  8.33992292e-03,
Jammy2211 commented 3 years ago

Just got it 250000 samples into a CubeSliceSampler

JohannesBuchner commented 3 years ago

If the proposal is bad (too few steps, no jumps are accepted), your live points can thin out to be linearly dependent. Then you get a numerical problem. I think you need to run with more steps.

Jammy2211 commented 3 years ago

I got the error with nsteps=100, but runs with 200 and 300 seem to be going ok so far.

Its probably worth me motivating why we are keen to have nsteps so low. Basically, we have found that random walk nested sampling / step based sampling is a very efficient method for our model-fitting problem. Using Dynesty, I did runs using a range of step values (5, 10, 15, 25, etc) and found that we got equivalent results, give or take, for all of the runs, but obviously the run with 5 steps was roughly twice as fast as 10, x5 as fast as 25, etc.

(I am sure a detailed inspection of the statistical properties of the results would show there are difference, but the log evidence values / parameter estimates did not vary much and no where near the level of systematic uncertanties that enter elsewhere).

Computationally, we are already pushing the limit of our resources, so getting the model-fit to produce a reasonable result fast is the primary concern. If UltraNest's step sampling results were on-par with what we currently get from Dynesty, this would be fine, as its the MPI support that is making me eager to make the swap to UltraNest. And before the crash, they look like they are going at a similar pace to one another, which is promising.

So, I guess my question is, do you think there would be a way to 'hack' a fix? If the error came up could the walk keep going until the proposal is 'good'? Reading the UltraNest docs, this seems to go against the philosophy of making sure the results are robust at the expense of run-time, but statistically question results is better than no results, right? :P

JohannesBuchner commented 3 years ago

Actually, looking back at your error stack trace, I think the issue is elsewhere.

update_region tries to compute a new region. Sometimes this fails for numerical reasons (e.g., cannot invert the sample covariance). This is supposed to be okay, and the old region is just kept.

This is why this line https://github.com/JohannesBuchner/UltraNest/find/master#L1850 is wrapped in a try-catch block, accepting FloatingPointError and LinAlgError.

But in the region.compute_enlargement function here: https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/mlfriends.pyx#L814 There are a few asserts, which check that everything is kosher.

I think you hit the assert f > 0 here.

These should probably not raise AssertErrors, but LinAlgErrors.

But there is potentially another problem here https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L353 If the code is run with MPI, and an exception is raised, the program does not go into the MPI block. This may freeze the program, because the other processors wait for communication.

Perhaps it is safer for that function to catch any AssertErrors of compute_enlargement and set r and f to NaNs.

Jammy2211 commented 3 years ago

Sounds promising, and like it can be fixed :).

Just FYI, these runs are not using MPI (I haven't got that far yet). Just simple serial runs on my laptop before moving on to the HPC. Your response sounds like you know this, and are just thinking of the conseqeunces of MPI, but just to be clear!

JohannesBuchner commented 3 years ago

Could you try this patch: https://github.com/JohannesBuchner/UltraNest/compare/feature-regionlinalgexceptions Perhaps you can try to retrigger the issue.

Jammy2211 commented 3 years ago

Im now trying to run the code from a GitHub repo clone, as the patch above is more recent than 3.2.0.

I get the following error message:

python imaging/searches/parametric/initialization/UltraNest.py
Traceback (most recent call last):
  File "imaging/searches/parametric/initialization/UltraNest.py", line 95, in <module>
    result = search.fit(model=model, analysis=analysis)
  File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/abstract_search.py", line 310, in fit
    self._fit(model=model, analysis=analysis, log_likelihood_cap=log_likelihood_cap)
  File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/nest/ultranest.py", line 154, in _fit
    import ultranest
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/__init__.py", line 7, in <module>
    from .integrator import NestedSampler, ReactiveNestedSampler, read_file
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 20, in <module>
    from ultranest.mlfriends import MLFriends, AffineLayer, ScalingLayer, find_nearby, WrappingEllipsoid, RobustEllipsoidRegion
ModuleNotFoundError: No module named 'ultranest.mlfriends'

I have run the following command:

python setup.py install

Am I missing another step to build the .pyx files?

JohannesBuchner commented 3 years ago

I think you are running the code from the repo (perhaps it is in your PYTHONPATH). In that case, you need python3 setup.py build_ext --inplace

Jammy2211 commented 3 years ago

That did it thanks! Will let you know how these runs go.

Jammy2211 commented 3 years ago

New error:

Traceback (most recent call last):
  File "imaging/searches/parametric/initialization/UltraNest.py", line 95, in <module>
    result = search.fit(model=model, analysis=analysis)
  File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/abstract_search.py", line 310, in fit
    self._fit(model=model, analysis=analysis, log_likelihood_cap=log_likelihood_cap)
  File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/nest/ultranest.py", line 198, in _fit
    self.sampler.run(
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 2133, in run
    for result in self.run_iter(
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 2336, in run_iter
    region_fresh = self._update_region(
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 1860, in _update_region
    r, f = _update_region_bootstrap(nextregion, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 382, in _update_region_bootstrap
    if e is None:
UnboundLocalError: local variable 'e' referenced before assignment
JohannesBuchner commented 3 years ago

Please try again, I updated the branch. Good to see you are reproducing it well :-)

Jammy2211 commented 3 years ago

I no longer get an error, but it would appear the fit stalls one way or another and stops progress (or begins to progress extremely slowly). Below is an extract of the outputs:

DEBUG:ultranest:iteration=1610, ncalls=9715, logz=-183186.59, remainder_fraction=1.3804%, Lmin=-183157.35, Lmax=-183156.76
DEBUG:ultranest:iteration=1613, ncalls=9733, logz=-183186.59, remainder_fraction=1.3238%, Lmin=-183157.35, Lmax=-183156.64
DEBUG:ultranest:iteration=1615, ncalls=9745, logz=-183186.59, remainder_fraction=1.2934%, Lmin=-183157.34, Lmax=-183156.64
DEBUG:ultranest:iteration=1618, ncalls=9763, logz=-183186.59, remainder_fraction=1.2390%, Lmin=-183157.34, Lmax=-183156.64
DEBUG:ultranest:iteration=1620, ncalls=9775, logz=-183186.59, remainder_fraction=1.2017%, Lmin=-183157.32, Lmax=-183156.53
DEBUG:ultranest:iteration=1580, ncalls=9997, logz=-183186.61, remainder_fraction=2.4999%, Lmin=-183157.62, Lmax=-183156.16
DEBUG:ultranest:iteration=2795, ncalls=16860, logz=-183185.50, remainder_fraction=5.8355%, Lmin=-183142.87, Lmax=-183140.64

So, I am guessing we averted the error, but are now in some sort of infinite loop where the method can't update the proposals in a way that improves the sampling?

The initial 5000 ncalls or so run just like they did before the error, so I think this confirms that the behaviour is due to the error, albeit in a way that now surpresses the Exception being called.

I will expirment with different nsteps but seems like this behaviour is general.

Jammy2211 commented 3 years ago

An identical rerun shows the same behaviour, but now it gets stuck at a logz around -216745:

DEBUG:ultranest:iteration=2893, ncalls=17409, logz=-216745.79, remainder_fraction=99.2916%, Lmin=-216676.78, Lmax=-216670.33

So I guess once it hits some sort of issue the sampler gets stuck making good proposals.

JohannesBuchner commented 3 years ago

What does sampler.live_points_healthy say?

Can you dump out what the live points look like (sampler.region.u) and investigate if they are weird (e.g., linearly dependent)?

The efficiency printed to stdout should not go much below 1/nsteps.

You don't have region_filter enabled in the step sampler, do you?

If you look at the live points, perhaps you can pick to and try a linear interpolation between them and plot what transform+likelihood say that the current surface looks like. Some likelihoods have numerical noise, not sure if that is the case here as well.

Jammy2211 commented 3 years ago

Will explore your suggestions above, just a quick update. If I exit the search after 5000 ncalls (to output visualization and update the results) and then resume the run (which works fine when not using a stepsampler) I get the following error:

Traceback (most recent call last):
  File "imaging/searches/parametric/initialization/UltraNest.py", line 95, in <module>
    result = search.fit(model=model, analysis=analysis)
  File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/abstract_search.py", line 310, in fit
    self._fit(model=model, analysis=analysis, log_likelihood_cap=log_likelihood_cap)
  File "/mnt/c/Users/Jammy/Code/PyAuto/PyAutoFit/autofit/non_linear/nest/ultranest.py", line 198, in _fit
    self.sampler.run(
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 2136, in run
    for result in self.run_iter(
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 2339, in run_iter
    region_fresh = self._update_region(
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 1773, in _update_region
    _update_region_bootstrap(self.region, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 388, in _update_region_bootstrap
    raise e
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 365, in _update_region_bootstrap
    r, f = region.compute_enlargement(
  File "ultranest/mlfriends.pyx", line 853, in ultranest.mlfriends.MLFriends.compute_enlargement
  File "<__array_function__ internals>", line 5, in inv
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 545, in inv
    ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 88, in _raise_linalgerror_singular
    raise LinAlgError("Singular matrix")
numpy.linalg.LinAlgError: Singular matrix
JohannesBuchner commented 3 years ago

Yes, the first region building doesn't have the same try-catch protection at the moment.

Jammy2211 commented 3 years ago

Ok, so something is going very wrong. Here is what the graphic of the live points looks like once I think its stuck in this loop:

centre_0          : -0.100000000|               -0.064399111  *  -0.064399104                                                                                                                           |+0.100000000
centre_1          : +0.00000000|                                                                                                                                                      +0.09817423  *   |+0.10000000
elliptical_comps_0: +0.00000000|                                                                                                                                   +0.08632405  *  +0.08632411         |+0.10000000
elliptical_comps_1: +0.0000000|           +0.1385928  *  +0.1385930                                                                                                                                   |+1.0000000
einstein_radius   : +0.000003|                                                                                        +1.789529  *  +1.789530                                                        |+2.999997
elliptical_comps_0: +0.0000000|                                                                                                                +0.1489596  *  +0.1489606                              |+0.1999996
elliptical_comps_1: +0.000000|                                                                                                        +0.069198  *  +0.069200                                        |+0.100000
centre_0          :  +0.00000|                            +0.22859  *  +0.22861                                                                                                                      | +1.00000
centre_1          :  -0.10000|                                                         -0.01899  *  -0.01898                                                                                         | +0.10000
elliptical_comps_0:   +0.0000|                                                                                                                 +0.0731  *  +0.0732                                   |  +0.1000
elliptical_comps_1:   -1.0000|                                                              -0.1448  *  -0.1445                                                                                      |  +1.0000
intensity         :    +0.000|                                     +0.274  *  +0.275                                                                                                                 |   +1.000
effective_radius  :    +0.000|             +1.304  *  +1.314                                                                                                                                         |  +10.000
sersic_index      :     +0.50|                      +1.30  **  +1.32                                                                                                                                 |    +5.00

Essentially, all points are relocating to a very narrow volumne of parameter space.

Before the weird change, this is what the distribution of points look like:

centre_0          :     -0.48|                                                            -0.09  * ****                     *                         *  **  +0.23                                   |    +0.48
centre_1          :    +0.000|   *                         **   *** ** *                                 *  +0.216                                                                                   |   +0.475
elliptical_comps_0:      -1.0|                                                    -0.3  *                             ***  *                 * *  +0.4                                               |     +1.0
elliptical_comps_1:      -1.0|                   -0.7  *                                                       *          ****    *          *    **  +0.4                                           |     +1.0
einstein_radius   :  +3.0e-06|     *                                                       *                 *             **   ****** *  *  +1.9e+00                                                | +3.0e+00
elliptical_comps_0:      -0.2|          *                                                                         **                                                         ********          *** **|     +0.2
elliptical_comps_1:      -0.2|            -0.2  *                   ** *                                   *                             *********                            *  +0.1                |     +0.2
centre_0          :      -1.0|                                 -0.5  *                                      *  *         **   ** *****     *****            *  +0.5                                  |     +1.0
centre_1          :      -1.0|                                                               -0.2  **      ****** *       **    ***  +0.2                                                            |     +1.0
elliptical_comps_0:      -1.0|         -0.8  *                                 *            ***                   * * ***********              **                *      *  +0.7                      |     +1.0
elliptical_comps_1:      -1.0|                                            -0.4  *             **   ********   ***    *  ** *  *****  +0.2                                                            |     +1.0
intensity         :    +0.000| *        * * *** ********** * * ***  ** * *      *       * *          *                        *  +0.575                                                              |   +1.000
effective_radius  :      +0.0|       *  ** **     * ** *   *** * **  ** *****  * * *         *   ** *                                    * * * * ** *  +21.3                                         |    +30.0
sersic_index      :      +0.5|  * ****** *********** *        **    **                                                                    *      *    *              ***  +4.2                       |     +5.0

Inspection of the live points confirms they have all gone to nearly identical parameter values:

**-0.06439910960150218**,0.0981742436119151,0.08632410301916939,0.13859289808496217,1.7895292422024234,0.14895996568522524,0.06919889247653405,0.22860720639473817,-0.018982604147621138,0.0731811673505227,-0.14466221406946267,0.27091193109117495,1.316757351895962,1.3216381892831708,-181748.83074813336,4.87352554297685,-181743.9572225904,0.013407300660035554
**-0.06439910948998101**,0.09817424476134377,0.08632409372582933,0.13859290560568494,1.789529760717818,0.1489600398479064,0.06919842616049288,0.2286075817334351,-0.018979280111244237,0.073205608524242,-0.14462487346758082,0.27166517316803207,1.3223810942201535,1.317392364083193,-181748.6251748058,4.863276722281564,-181743.76189808353,0.016467236027840757
**-0.06439910958163868**,0.09817423949921626,0.08632409916221573,0.13859289654727142,1.7895293482497112,0.14895997541108358,0.06919882187257725,0.22860595597436698,-0.01898379379550144,0.07318884633614044,-0.14465690414799035,0.27108737740804345,1.317561980179545,1.323496276690078,-181748.37947281572,4.871132797875134,-181743.50834001784,0.021053665997334266
-0.06439910958497416,0.09817423886452462,0.08632410293142254,0.1385928899429708,1.7895295716246569,0.1489600060866913,0.0691988160017456,0.22860685588931237,-0.018977178670060144,0.07319491512674682,-0.1446632309122282,0.27107607507666215,1.3199135722780464,1.3199808005992157,-181748.30047552608,4.871292917504033,-181743.42918260858,0.02278430658410807
-0.06439910926134053,0.09817423686196158,0.08632409909958223,0.13859289672965816,1.789529538172654,0.14896007517408377,0.06919885296840655,0.22860635908257937,-0.018978588582163277,0.07319686145819135,-0.1446438206926631,0.2710535983560723,1.3191513190634658,1.3179546109555353,-181748.18307232799,4.871587181742363,-181743.31148514623,0.025622610628496332
-0.06439911033573704,0.09817424028203765,0.08632409951277857,0.13859289908077344,1.7895294976978384,0.14895996329886396,0.06919877027838545,0.22860519902372653,-0.018980970783291105,0.07317988024779151,-0.1446666714141247,0.27098896907166814,1.3171535494595326,1.3202207214663955,-181748.18197815598,4.872472910802454,-181743.30950524518,0.025650661515323282
-0.06439910820793321,0.09817425202587972,0.08632411092042633,0.13859295472924235,1.7895297771451393,0.14896030828065848,0.06919835047496598,0.22860609237492285,-0.018981908435017993,0.07320599260773228,-0.14463659403866524,0.2717679978805083,1.322373857715808,1.3155427219194662,-181748.00860936329,4.861887817266632,-181743.146721546,0.030506450530194327
-0.06439910931520619,0.0981742429899657,0.08632409429889384,0.1385928999815098,1.7895298760652651,0.1489599861814822,0.06919862912319524,0.2286070425321011,-0.0189798931669384,0.07320175516834752,-0.14462188257872574,0.2717931725305268,1.3199316918199293,1.3192562810976645,-181745.63782159035,4.861539051085719,-181740.77628253927,0.3265966512047664
-0.06439910937211618,0.0981742453728004,0.08632412642251558,0.13859292474973925,1.7895297098991287,0.1489600742741914,0.0691981328440126,0.22860825224388395,-0.01897755603324009,0.07321188669532465,-0.1446405075508717,0.2716357827345963,1.3190350169327705,1.3144189141096292,-181745.23177312632,4.863687288063371,-181740.36808583824,0.4901808433186678

It would seem that we are at some point suddenly replacing all live points with an extremely narrow set of values, whenever the numerical issue that was raising an exception previously occurs?

Jammy2211 commented 3 years ago

Inspecting the debug.log file reveals the following errors:

Traceback (most recent call last):
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 1863, in _update_region
    r, f = _update_region_bootstrap(nextregion, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 388, in _update_region_bootstrap
    raise e
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 365, in _update_region_bootstrap
    r, f = region.compute_enlargement(
  File "ultranest/mlfriends.pyx", line 860, in ultranest.mlfriends.MLFriends.compute_enlargement
numpy.linalg.LinAlgError: Distances are not positive
Traceback (most recent call last):
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 1865, in _update_region
    nextregion.create_ellipsoid(minvol=minvol)
  File "ultranest/mlfriends.pyx", line 1026, in ultranest.mlfriends.MLFriends.create_ellipsoid
FloatingPointError: invalid value encountered in sqrt
Traceback (most recent call last):
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 1863, in _update_region
    r, f = _update_region_bootstrap(nextregion, nbootstraps, minvol, self.comm if self.use_mpi else None, self.mpi_size)
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 388, in _update_region_bootstrap
    raise e
  File "/mnt/c/Users/Jammy/Code/PyAuto/UltraNest/ultranest/integrator.py", line 365, in _update_region_bootstrap
    r, f = region.compute_enlargement(
  File "ultranest/mlfriends.pyx", line 853, in ultranest.mlfriends.MLFriends.compute_enlargement
  File "<__array_function__ internals>", line 5, in inv
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 545, in inv
    ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
  File "/home/jammy/venvs/PyAuto/lib/python3.8/site-packages/numpy/linalg/linalg.py", line 88, in _raise_linalgerror_singular
    raise LinAlgError("Singular matrix")
numpy.linalg.LinAlgError: Singular matrix

Still digging, but my feeling is that the errors above are essentially causing the method to converge on a very narrow range of parameter space (which has a much lower logz than the true solution).

It probably stems from issues with the likelihood function, but Dynesty and other samplers work OK for this problem.

JohannesBuchner commented 3 years ago

Perhaps you can put a few prints in the step sampler code to see what is being proposed, what its likelihood was, and whether it was accepted. Then you should be able to see where it is running around.

The stepsampler can also store a log (log=open('stepsampler.log', 'w')) where it stores acceptance rates and other stuff (logstat and logstat_labels in the code). You can even plot it (stepsampler.plot('stepsamplerstats.pdf')).

It would be great if you can investigate why this happens, and perhaps we can recognise and circumvent such situations.

Jammy2211 commented 3 years ago

Digging further into this but not got any super great insights yet.

The one thing I've noted is that if I plot the active physical points, it is always ends up i a situation where multiple points have the same physical values:

 [-3.08907767e-02 -1.61065791e-03 -3.22974652e-01  3.05938787e-01
   1.83378711e+00 -3.87885334e-02  8.93126697e-03  1.81546880e-01
   1.36110472e-01  4.92397997e-01 -1.52481447e-01  1.92147307e+01
   4.73909575e+00  1.20995652e-02] # This is a live point
 [-9.57800449e-02 -3.83242618e-02 -4.34228190e-01 -3.90178169e-01
   2.24544749e+00  1.45616738e-01 -1.16783854e-01 -1.47338829e-01
   1.81373990e-01 -1.33452392e-01 -2.06928418e-01  5.89996529e-01
   1.26600550e+00  7.43064147e-01]
 [-3.60315319e-03  8.13240141e-02 -3.98797747e-01  4.52679127e-02
   2.95989926e+00  9.04596887e-02  9.61625401e-02 -1.29335419e-01
   6.99180889e-02  2.35738067e-01  8.31830576e-01  7.76290474e+00
   2.44269469e+00  4.81440154e-02]
 **[-3.08907767e-02 -1.61065791e-03 -3.22974652e-01  3.05938787e-01
   1.83378711e+00 -3.87885334e-02  8.93126697e-03  1.81546880e-01
   1.36110472e-01  4.92397997e-01 -1.52481447e-01  1.92147307e+01
   4.73909575e+00  1.20995652e-02]] # This is an identical live point

My guess is that the try / except's are reusing samples one way or another, and that once this happens enough time I end up in a situation where all the live points go to the same physical sample.

JohannesBuchner commented 3 years ago

This implies the stepsampler made n steps and no proposal was accepted. Very strange if that happens with the slice samplers, because the shrink the interval until a new point was accepted.

JohannesBuchner commented 3 years ago

Perhaps put a print in the two branches of this if: https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/stepsampler.py#L595 It should tell you the proposed point (unew, pnew) and its likelihood Lnew, and whether it was accepted or rejected.

Such a print will produce lots of output, but it should be insightful to stare at the part where there are very many rejects (or just before where the Exception happens) and understand what is happening there.

Alternatively, you can also put this print into adjust_accept.

You may also want to print self.scale (and for the slice sampler self.interval). That should tell you about the proposal the stepsampler is considering.