DE freezing during the run?

fzeiser commented 3 years ago

I am confused about the output of a run that I started. I've tried this type of job and it fails repeatably (or seems to fail). I will attach some debugging information, but it might be easiest if we could schedule a meeting @anderkve :pray: ?

I don't see a descriptive error message, but from the GAMBIT log, it seems like there might be a DMI communication problem. If you look a the timestamps, the job starts 11 April, 5 am and is killed 13 April, 5 am. However, the problem occurs already ~6 am on April 11th.

From stdout of process 0:

Starting GAMBIT
----------
Running in MPI-parallel mode with 32 processes
----------
Running with 1 OpenMP threads per MPI process (set by the environment variable OMP_NUM_THREADS).
YAML file: /cluster/projects/nn9464k/progs/gambit_np/submit/../yaml_files/NuclearBit_162Dy_43_1.yaml
Initialising logger...  log_debug_messages = false; log messages tagged as 'Debug' will NOT be logged
Rank 0 resume flag? 0
Resolving dependencies and backend requirements.  Hang tight...
...done!
ScannerBit is waiting for all MPI processes to join the scan...
  All processes ready!
Loading Diver differential evolution plugin for ScannerBit.
Starting Diver run...

stderr:

Starting scan.
slurmstepd: error: *** STEP 2404626.0 ON c33-8 CANCELLED AT 2021-04-13T05:07:27 DUE TO TIME LIMIT ***

default.log_0:

(Sun Apr 11 05:07:21 2021)(-35.4227 [s])(Rank 0)[Default,Backends][Warning][Non-
fatal]:
[...] <-- warning messages from missing backends

[...]

--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.28 [s])(Rank 0)[Default]:
Total lnL: -12086.2
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.4 [s])(Rank 0)[Default,NuclearBit]:
gledeli.interface INFO: This point will get a loglike of -12086.201029010668

--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.42 [s])(Rank 0)[Default]:
Total lnL: -10945.7
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.42 [s])(Rank 0)[Default,Core][Info]:
Received EMERGENCY shutdown message from process with rank 29
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.42 [s])(Rank 0)[Default]:
Shutdown is in progress; emergency=1 (loop=0)
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.42 [s])(Rank 0)[Default]:

--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.53 [s])(Rank 0)[Default]:
Received emergency shutdown command via MPI! Terminating run.
GAMBIT has shutdown due to an error on another process.

--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.53 [s])(Rank 0)[Default,Utilities][Info]:
rank 0: Entered BarrierWithTimeout (with tag 6666)
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.53 [s])(Rank 0)[Default,Utilities][Info]:
Communicator 'COMM_WORLD' received 1 messages with tag 6666. Only the last of these will be readable from the output buffer, the rest were discarded.
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.53 [s])(Rank 0)[Default,Utilities][Info]:
rank 0: Process 5 has entered BarrierWithTimeout (with tag 6666). Resetting timeout.
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.53 [s])(Rank 0)[Default,Utilities][Info]:
Communicator 'COMM_WORLD' received 1 messages with tag 6666. Only the last of these will be readable from the output buffer, the rest were discarded.

[...] <-- same message from all other processes

--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.63 [s])(Rank 0)[Default,Utilities][Info]:
rank 0: Process 22 has entered BarrierWithTimeout (with tag 6666). Resetting timeout.
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.63 [s])(Rank 0)[Default,Utilities][Info]:
Communicator 'COMM_WORLD' received 1 messages with tag 6666. Only the last of these will be readable from the output buffer, the rest were discarded.
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.63 [s])(Rank 0)[Default,Utilities][Info]:
rank 0: Process 26 has entered BarrierWithTimeout (with tag 6666). Resetting timeout.
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:15 2021)(5238.63 [s])(Rank 0)[Default,Utilities][Info]:
rank 0: sleeping... (total timeout = 10000ms; sleeptime = 100ms)
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:16 2021)(5238.73 [s])(Rank 0)[Default,Utilities][Info]:
rank 0: time_waited = 100ms (1%% of time allowed). True time waited is 202ms.
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:16 2021)(5238.73 [s])(Rank 0)[Default,Utilities][Info]:
rank 0: Synchronisation succeeded in BarrierWithTimeout (tag=6666)!
--<>--<>--<>--<>--<>--<>--<>--
(Sun Apr 11 06:35:16 2021)(5238.73 [s])(Rank 0)[Default,Core][Info]:
NO_MORE_MESSAGES code broadcast to all processes
--<>--<>--<>--<>--<>--<>--<>--

fzeiser commented 3 years ago

Out of curiosity I checked whether I could read the file with h5dump. That seems to work. But I can't read it in python:

[fabiobz@login-3.FRAM /cluster/projects/nn9464k/progs/gambit_np/Backends/installed/gledeli/1.0/plotting]$ python3 ../../../../../submit/runs/NuclearBit_162Dy_43_1/samples/results.hdf5
  File "../../../../../submit/runs/NuclearBit_162Dy_43_1/samples/results.hdf5", line 1
SyntaxError: Non-UTF-8 code starting with '\x89' in file ../../../../../submit/runs/NuclearBit_162Dy_43_1/samples/results.hdf5 on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

This error does not appear for another, much shorter, test run I had on the same machine.

anderkve commented 3 years ago

Given this message Received EMERGENCY shutdown message from process with rank 29 there might be a sensible error message in the default.log_29 file.

But a meeting might indeed be the easiest -- I'll send you an email right away :)

fzeiser commented 3 years ago

Well spotted, I should have realized this:

err-2404626-29:

 ESC[00;31;1mFATAL ERRORESC[00m

GAMBIT has exited with fatal exception: Error in function boost::math::erf_inv<double>(double, double): Overflow Error

It might well be from the prior I implemented... So either one goes for a full debug mode, or check in #6 / in d0f653df998

fzeiser commented 3 years ago

So I guess it's here: https://github.com/anderkve/gambit_np/blob/d0f653df998cdc8f280ec72255979940618be1cc/ScannerBit/src/priors/truncatedgaussian.cpp#L70-L74

https://github.com/anderkve/gambit_np/blob/d0f653df998cdc8f280ec72255979940618be1cc/ScannerBit/src/priors/truncatedgaussian.cpp#L103-L106

But this matches also closely the implementation of the gaussian prior :confused:, which I assume works as expected. Could it be rounding errors calculating a and b or so? https://github.com/anderkve/gambit_np/blob/d0f653df998cdc8f280ec72255979940618be1cc/ScannerBit/include/gambit/ScannerBit/priors/gaussian.hpp#L62-L71

fzeiser commented 3 years ago

Seems like r=0; lower_cut = mu -> erf_inv(0) works fine, but erf_inv(1) throws this type of error, which would happen for upper_cut=infty [which is the same as: no upper cut]. Equivalently, for upper_cut = mu and alpha = - infty.

[Note: Mathematically, erf_inv(1) = infty]

fzeiser commented 3 years ago

We could test/add these as special cases if indeed r=[0, 1] and not (0,1). However, he lower and upper cuts were anyhow defined such that I don't see this happening immediately. Potentialy we could check the output with increased verbosity and a very easy likelihood, like spartan?

fzeiser commented 3 years ago

@anderkve: How did you retrieve the hypercube pars? Is it just this one in the yaml file?

KeyValues:

  debug: true  # Set to true for very verbose log files

anderkve commented 3 years ago

Turns out there was an option to print the hypercube parameters to the output, I just remembered the name wrong. It's print_unitcube: true in the KeyValues block. Check spartan.yaml for an example.

To get the parameter point written to the log file (not in unit hypercube form, though) you need to run with the debug: true option you mention.

anderkve commented 3 years ago

(I suspect adding a couple of cout statements directly in the prior function might still be the easiest debugging approach, though.)

fzeiser commented 3 years ago

stupid question, but what cout / scanner_cout ... do I use to get things into the default log?

anderkve commented 3 years ago

In a regular GAMBIT module function you can do logger() << "your message" << EOM;. See here for examples:

https://github.com/anderkve/gambit_np/blob/3183fff753376523730eda1920711f221224a9b5/ExampleBit_A/src/ExampleBit_A.cpp#L134

If you want to get the same message also printed to cout / cerr, you can do logger() << LogTags::repeat_to_cout << "your message" << EOM;, and similarly with LogTags::repeat_to_cerr.

Though, it might be that this logger stuff won't work the same way from within a prior transform function, because it's not a GAMBIT module function in the usual sense. I'd need to look into the details of that.

fzeiser commented 3 years ago

It works :)

fzeiser commented 3 years ago

The problem arises e.g. for following prior transformation:

prior: Truncated Gaussian

mu = 0
sigma = 1
lower_cut = 0
upper_cut = 50

UnitPars ["r"]: 1
lower_cut: 0
upper_cut: 50

So the hypercube includes here [0, 1]!

The calculations lead to [double checked with a python script I wrote, matching the cpp code]

alpha = (lower_cut - mu) / sigma
beta = (upper_cut - mu) / sigma

a = 0.5*(1 + erf(alpha/np.sqrt(2)))
b = 0.5*(1 + erf(beta/np.sqrt(2)))

a ≃ 0.5
b ≃ 1

and further

inner = 2*(r*(b-a) + a) - 1  # = 2b -1 ≃ 1
x = np.sqrt(2) * erfinv(inner)
x = x * sigma + mu

The argument of the inverse error function inner is not exactly one, but up to machine precision. This leads to a problem in the inverse error function, it will go to infty because of the machine precision issue. Maybe I could specify some cutoff for inner + add that for r=1 -> x= upper_limit? I have to double check this last one, but rather sure about it.

fzeiser commented 3 years ago

Actually from some tryouts, I think just adding two if-tests, for r=0and r=1 would be sufficient.

anderkve commented 3 years ago

Cool, thanks for investigating it! The if-tests sounds like a sensible fix.

fzeiser commented 3 years ago

Ok, so I could add these two cases, but then we may get the problem further down the line. For r=1, the transformed value will be x=upper_cut. This makes sense thinking about the inversion of the CDF, right :)? Anyhow, if the upper cut is not defined, it defaults to +infty. So far, so good. But we don't want our parameter going into the likelihood calculations to be infinite :confused: .

Even though I've looked at the code of the gaussian prior, I cannot quite understand where it differs in this important point. There, whenever the unitcube parameter is 0 or 1, you should get +- infty, I assume. [Or from boost::math::erf_inv one gets the error message that I've seen above.]. However, does not seem to be the case as you would otherwise have certainly run into this issue before. :confused:

truncated gaussian https://github.com/anderkve/gambit_np/blob/d0f653df998cdc8f280ec72255979940618be1cc/ScannerBit/src/priors/truncatedgaussian.cpp#L103-L106

gaussian https://github.com/anderkve/gambit_np/blob/d0f653df998cdc8f280ec72255979940618be1cc/ScannerBit/include/gambit/ScannerBit/priors/gaussian.hpp#L62-L71

@anderkve: Any good ideas on this?

fzeiser commented 3 years ago

Of course, in principle it is correct that r=1 gives you x=infty -- just that I never thought that a (truncated) gaussian distribution would actually ever return that value. For the truncated case, in which a real truncation exists, it somewhat better, because one could set e.g. x = upper_cut. So either we might modify gledeli to check that the parameters are finite -- or return already an error for the prior and redraw it, if that's possible.

fzeiser commented 3 years ago

I put in an if test, but still getting the same error. Somehow the comparison operator doesn't behave well:

 double r = unitpars[0]; // input unit cube parameter

 double x;
 if (r==0){x = lower_cut;}
 else if (r==1){x = upper_cut;}
 else{
 // Transformation:
 logger() << "\n\nlower_cut: " << lower_cut << "\nupper_cut: " << upper_cut
          << "\n mu: " << mu << "\nsigma: " << sigma
          << "\nr: " << r << endl;
 x = M_SQRT2*boost::math::erf_inv(2*(r*(b-a) + a) - 1); // output (result)
 x = x * sigma + mu;
 }
 output[myparameter] = x;

leads again to the Error in function boost::math::erf_inv<double>(double, double): Overflow Error error. The output from the logger is:

 lower_cut: 0
upper_cut: 50
 mu: 0
sigma: 1
r: 1
EMERGENCY_SHUTDOWN code broadcast to all processes

Which means that the r==1 case was not recognized. I thought that cout in cpp wouldn't convert numbers close to 1 to be represented by one, so that the logger r: 1 should have lead to r==1 --> True. Hm :disappointed:

fzeiser commented 3 years ago

@anderkve: Do you have a suggestion on this one?

anderkve commented 3 years ago

Hey -- apologies, I'm way behind on my backlog of GitHub notifications. :P

First, the cout function can indeed convert numbers unless you specify the precision as in the example below. (I've made oh so many annoying mistakes due to this... :D )

I reproduced the overflow error with the stand-alone example below. Turns out the overflow error only happens when the difference |1.0 - r| is within the double precision, but not if r is interpreted as being equal to 1.0. A solution seems to be to perform a check using std::numeric_limits<double>::epsilon() from <limits> and nudge r away from the problem region if necessary -- see below

#include <iostream>
#include <cmath>
#include <boost/math/special_functions/erf.hpp>
#include <limits>

int main()
{
  double mu = 0;
  double sigma = 1;
  double lower_cut = 0;
  double upper_cut = 50;

  double alpha = (lower_cut - mu) / sigma;
  double beta = (upper_cut - mu) / sigma;
  double a = 0.5*(1 + erf(alpha/M_SQRT2));
  double b = 0.5*(1 + erf(beta/M_SQRT2));

  std::cout << std::setprecision(20) << " a = " << a << std::endl;
  std::cout << std::setprecision(20) << " b = " << b << std::endl;

  // double r = 0.999999999999999;     // <-- works, interpreted as 0.99999...
  // double r = 0.9999999999999998;    // <-- works, interpreted as 0.99999...
  double r = 0.9999999999999999;    // <-- gives overflow error if not adjusted
  // double r = 0.99999999999999994;   // <-- gives overflow error if not adjusted
  // double r = 0.99999999999999995;   // <-- works, interpreted as 1.0
  // double r = 0.99999999999999999;   // <-- works, interpreted as 1.0

  std::cout << std::setprecision(20) << " r = " << r << std::endl;

  // This fix solves the overflow error
  if (1.0 - r < std::numeric_limits<double>::epsilon())
  {
    r = 1.0 - 2 * std::numeric_limits<double>::epsilon();
    std::cout << "   Overflow fix! Adjusted r value:" << std::endl;
    std::cout << std::setprecision(20) << "   r = " << r << std::endl;
  }

  double x;

  if (r==0) 
  {
    x = lower_cut;
  }
  else if (r==1) 
  {
    x = upper_cut;
  }
  else
  {
    x = M_SQRT2*boost::math::erf_inv(2*(r*(b-a) + a) - 1); // output (result)
    x = x * sigma + mu;
  }

  std::cout << std::setprecision(20) << " x = " << x << std::endl;

  std::cout << std::setprecision(20) << " std::numeric_limits<double>::epsilon() = " << std::numeric_limits<double>::epsilon() << std::endl;

  return 0;
}

anderkve / gambit_np

DE freezing during the run? #10