ExaScience / smurff

Bayesian Factorization with Side Information in C++ with Python wrapper
MIT License
70 stars 14 forks source link

training fails on Linux #137

Closed nbosc closed 3 years ago

nbosc commented 3 years ago

Hi,

I try to run a training session with a binary matrix and side info using smurff 0.15.3. I started with a sample of my data on macOS first and it runs smoothly. As I expect the job to last hours with my whole data set, I would like to use a linux cluster. Same version of Smurff is installed but the job ends with a strange error.

On top of that there are warnings that I don't have on macOS.

Maybe the error is linked to the warning but hard to identify where is the issue considering that the same data...

PythonSession {
  Data: {
    Type: ScarceMatrixData [with NAs]
    Component-wise mean: -0.950016
    Component-wise variance: 0.58482
    Noise: Probit Noise with threshold 0
    Size: 12644 [500 x 100] (25.29%)
      Warning: 11 empty cols
  }
  Model: {
    Num-latents: 32
  }
  Priors: {
    0: MacauPrior
     SideInfo: DenseDouble [500, 1030]
     Method: CG Solver with tolerance: 1.00e-06
     BetaPrecision: fixed at 5.00
    1: NormalPrior
  }
  Result: {
    Test data: 12645 [500 x 100] (25.29%)
    Binary classification threshold: 0.00
      2.39% positives in test data
  }
  Config: {
      Iterations: 40 burnin + 100 samples
      Save model: every 5 iteration
      Save prefix: /scratch/tmp5w89u3oj/
      Save extension: .ddm
  }
}
 ====== Initial phase ======
Initial   0/  0: RMSE: nan (1samp: nan)  U:[0.00e+00, 0.00e+00, ] [took: 0.0s, total: 0.0s]
warning: block_cg: could not find a solution in 1000 iterations; residual: [6.45308 5.91201 6.02034 6.34163 6.21799 5.70255 5.94154 6.54877 6.38536 6.58253 6.27891 6.36777 5.90221 6.42484 6.67025 6.09148 6.42674 6.36864 6.14103  6.3748  6.5604 6.41111 6.41783 6.40632 6.43813 6.21122  5.7952 6.19924 6.17669 6.13731 6.18038 6.10409 ].all() > 1e-06
 ====== Sampling (burning phase) ======
Burnin   1/ 40: RMSE: nan (1samp: 2.7011) AUC:nan (1samp: 0.4955)  U:[4.03e+01, 1.61e+01, ] [took: 1.6s, total: 1.6s]
warning: block_cg: could not find a solution in 1000 iterations; residual: [5.20845 5.10647 5.09606 5.16446 5.18835  4.9973 5.09204 5.11095 5.21397 5.19957 5.06096 5.09089  5.0152 5.11945 5.12727 5.03547 5.08997 5.20446 5.13861  5.1579 5.14239 5.12274 5.04472 5.21932 5.21224 5.18471 5.19778 5.19342 5.27777 5.15257 5.28383 5.20315 ].all() > 1e-06
Burnin   2/ 40: RMSE: nan (1samp: 13.9891) AUC:nan (1samp: 0.6181)  U:[1.62e+02, 5.61e+02, ] [took: 1.4s, total: 3.0s]
warning: block_cg: could not find a solution in 1000 iterations; residual: [4.87692 4.82969 4.88019 4.90073  4.8708 4.82129 4.86188 4.88822 4.94737 4.84569 4.90376 4.89433  4.9465 4.94972 4.83701 4.84832  4.8229  4.9841 4.89196 4.93839 4.94808 4.85216 4.89788 4.98703 4.78823 4.91235  4.8317 4.91458 4.80154 4.83779 4.86092 4.88723 ].all() > 1e-06
Burnin   3/ 40: RMSE: nan (1samp: 219347.6467) AUC:nan (1samp: 0.6393)  U:[8.52e+02, 1.81e+06, ] [took: 1.3s, total: 4.3s]
warning: block_cg: could not find a solution in 1000 iterations; residual: [4.97918  4.9903 4.97503 4.95984  4.9819 4.86375 4.89908 4.96771 4.97491 4.96214 4.95147 4.94525 4.95701 4.96889 4.97137 4.97788 4.97502 4.99405 4.98754 4.97425 4.97976 4.96095 4.98221 4.98948 4.85082 4.91915 4.85763 4.87478 4.86864 4.84611 4.94993 4.92186 ].all() > 1e-06
Burnin   4/ 40: RMSE: nan (1samp: 871608.7822) AUC:nan (1samp: 0.6612)  U:[4.58e+03, 1.26e+06, ] [took: 1.4s, total: 5.7s]
warning: block_cg: could not find a solution in 1000 iterations; residual: [5.08872 5.06012 5.06838 5.05406 5.01789 5.02229 5.02794 5.01241 5.10187 5.02456 5.02849 5.06436 5.02885 5.03712 5.01687 5.03718 5.02574 5.07647  5.0552 5.05633 5.03331 5.02659 5.02183 5.03783 4.92932 4.94669 4.99202 5.03898 5.02493 4.94725 4.98231 4.95509 ].all() > 1e-06
terminate called recursively
terminate called after throwing an instance of 'std::runtime_error'
terminate called recursively
/lsf/01/1616414381.3896910: line 8: 55448 Aborted                 python 02_macau_model.py --input_file training_sample_data.pkl
tvandera commented 3 years ago

Hi, instead of using Method: CG Solver with tolerance: 1.00e-06, try using the direct inversion method.

Set direct=True in addSideInfo or in MacauSession

nbosc commented 3 years ago

Still working on macOS, fails at the same stage on Linux

Using OpenMP with up to 6 threads.
PythonSession {
  Data: {
    Type: ScarceMatrixData [with NAs]
    Component-wise mean: -0.950016
    Component-wise variance: 0.58482
    Noise: Probit Noise with threshold 0
    Size: 12644 [500 x 100] (25.29%)
      Warning: 11 empty cols
  }
  Model: {
    Num-latents: 32
  }
  Priors: {
    0: MacauPrior
     SideInfo: DenseDouble [500, 1030]
     Method: Cholesky Decomposition
     BetaPrecision: fixed at 5.00
    1: NormalPrior
  }
  Result: {
    Test data: 12645 [500 x 100] (25.29%)
    Binary classification threshold: 0.00
      2.39% positives in test data
  }
  Config: {
      Iterations: 40 burnin + 100 samples
      Save model: every 5 iteration
      Save prefix: /scratch/tmp6q0d7stj/
      Save extension: .ddm
  }
}
 ====== Initial phase ======
Initial   0/  0: RMSE: nan (1samp: nan)  U:[0.00e+00, 0.00e+00, ] [took: 0.0s, total: 0.0s]
 ====== Sampling (burning phase) ======
Burnin   1/ 40: RMSE: nan (1samp: 2.6843) AUC:nan (1samp: 0.4812)  U:[4.00e+01, 1.65e+01, ] [took: 0.2s, total: 0.2s]
Burnin   2/ 40: RMSE: nan (1samp: 3.0714) AUC:nan (1samp: 0.5856)  U:[4.10e+01, 1.86e+02, ] [took: 0.1s, total: 0.2s]
Burnin   3/ 40: RMSE: nan (1samp: 128.8635) AUC:nan (1samp: 0.6219)  U:[4.36e+01, 1.33e+04, ] [took: 0.1s, total: 0.3s]
Burnin   4/ 40: RMSE: nan (1samp: 139.8078) AUC:nan (1samp: 0.4775)  U:[4.51e+01, 1.16e+04, ] [took: 0.1s, total: 0.4s]
Burnin   5/ 40: RMSE: nan (1samp: 182799.3654) AUC:nan (1samp: 0.5110)  U:[4.42e+01, 2.20e+07, ] [took: 0.1s, total: 0.5s]
Burnin   6/ 40: RMSE: nan (1samp: 1846531.0205) AUC:nan (1samp: 0.4930)  U:[4.46e+01, 2.95e+08, ] [took: 0.1s, total: 0.5s]
Burnin   7/ 40: RMSE: nan (1samp: 1363690456.2558) AUC:nan (1samp: 0.5217)  U:[4.50e+01, 3.16e+11, ] [took: 0.1s, total: 0.6s]
terminate called recursively
terminate called recursively
terminate called recursively
/lsf/01/1616425828.3919693: line 8: 124386 Aborted                 python 02_macau_model.py --input_file training_sample_data.pkl
tvandera commented 3 years ago

Okay, I think we need more info here on what you are trying to do.

Cheers, Tom

nbosc commented 3 years ago

Right. Before I'd like to reiterate that with the same data set and the same version of smurff but a different OS, this works fine. For your questions:

tvandera commented 3 years ago

Hi, indeed, if it works on macOS, it should also work on Linux. Maybe the easiest way would be for me to reproduce the problem?

nbosc commented 3 years ago

That would be very helpful, thanks. Can you share an email so I can send you the dataset and the script?

nbosc commented 3 years ago

Thanks @tvandera .

Apparently there was something wrong in my conda environment and the problem was solved by creating a new one.