training fails on Linux

nbosc commented 3 years ago

Hi,

I try to run a training session with a binary matrix and side info using smurff 0.15.3. I started with a sample of my data on macOS first and it runs smoothly. As I expect the job to last hours with my whole data set, I would like to use a linux cluster. Same version of Smurff is installed but the job ends with a strange error.

On top of that there are warnings that I don't have on macOS.

Maybe the error is linked to the warning but hard to identify where is the issue considering that the same data...

PythonSession {
  Data: {
    Type: ScarceMatrixData [with NAs]
    Component-wise mean: -0.950016
    Component-wise variance: 0.58482
    Noise: Probit Noise with threshold 0
    Size: 12644 [500 x 100] (25.29%)
      Warning: 11 empty cols
  }
  Model: {
    Num-latents: 32
  }
  Priors: {
    0: MacauPrior
     SideInfo: DenseDouble [500, 1030]
     Method: CG Solver with tolerance: 1.00e-06
     BetaPrecision: fixed at 5.00
    1: NormalPrior
  }
  Result: {
    Test data: 12645 [500 x 100] (25.29%)
    Binary classification threshold: 0.00
      2.39% positives in test data
  }
  Config: {
      Iterations: 40 burnin + 100 samples
      Save model: every 5 iteration
      Save prefix: /scratch/tmp5w89u3oj/
      Save extension: .ddm
  }
}
 ====== Initial phase ======
Initial   0/  0: RMSE: nan (1samp: nan)  U:[0.00e+00, 0.00e+00, ] [took: 0.0s, total: 0.0s]
warning: block_cg: could not find a solution in 1000 iterations; residual: [6.45308 5.91201 6.02034 6.34163 6.21799 5.70255 5.94154 6.54877 6.38536 6.58253 6.27891 6.36777 5.90221 6.42484 6.67025 6.09148 6.42674 6.36864 6.14103  6.3748  6.5604 6.41111 6.41783 6.40632 6.43813 6.21122  5.7952 6.19924 6.17669 6.13731 6.18038 6.10409 ].all() > 1e-06
 ====== Sampling (burning phase) ======
Burnin   1/ 40: RMSE: nan (1samp: 2.7011) AUC:nan (1samp: 0.4955)  U:[4.03e+01, 1.61e+01, ] [took: 1.6s, total: 1.6s]
warning: block_cg: could not find a solution in 1000 iterations; residual: [5.20845 5.10647 5.09606 5.16446 5.18835  4.9973 5.09204 5.11095 5.21397 5.19957 5.06096 5.09089  5.0152 5.11945 5.12727 5.03547 5.08997 5.20446 5.13861  5.1579 5.14239 5.12274 5.04472 5.21932 5.21224 5.18471 5.19778 5.19342 5.27777 5.15257 5.28383 5.20315 ].all() > 1e-06
Burnin   2/ 40: RMSE: nan (1samp: 13.9891) AUC:nan (1samp: 0.6181)  U:[1.62e+02, 5.61e+02, ] [took: 1.4s, total: 3.0s]
warning: block_cg: could not find a solution in 1000 iterations; residual: [4.87692 4.82969 4.88019 4.90073  4.8708 4.82129 4.86188 4.88822 4.94737 4.84569 4.90376 4.89433  4.9465 4.94972 4.83701 4.84832  4.8229  4.9841 4.89196 4.93839 4.94808 4.85216 4.89788 4.98703 4.78823 4.91235  4.8317 4.91458 4.80154 4.83779 4.86092 4.88723 ].all() > 1e-06
Burnin   3/ 40: RMSE: nan (1samp: 219347.6467) AUC:nan (1samp: 0.6393)  U:[8.52e+02, 1.81e+06, ] [took: 1.3s, total: 4.3s]
warning: block_cg: could not find a solution in 1000 iterations; residual: [4.97918  4.9903 4.97503 4.95984  4.9819 4.86375 4.89908 4.96771 4.97491 4.96214 4.95147 4.94525 4.95701 4.96889 4.97137 4.97788 4.97502 4.99405 4.98754 4.97425 4.97976 4.96095 4.98221 4.98948 4.85082 4.91915 4.85763 4.87478 4.86864 4.84611 4.94993 4.92186 ].all() > 1e-06
Burnin   4/ 40: RMSE: nan (1samp: 871608.7822) AUC:nan (1samp: 0.6612)  U:[4.58e+03, 1.26e+06, ] [took: 1.4s, total: 5.7s]
warning: block_cg: could not find a solution in 1000 iterations; residual: [5.08872 5.06012 5.06838 5.05406 5.01789 5.02229 5.02794 5.01241 5.10187 5.02456 5.02849 5.06436 5.02885 5.03712 5.01687 5.03718 5.02574 5.07647  5.0552 5.05633 5.03331 5.02659 5.02183 5.03783 4.92932 4.94669 4.99202 5.03898 5.02493 4.94725 4.98231 4.95509 ].all() > 1e-06
terminate called recursively
terminate called after throwing an instance of 'std::runtime_error'
terminate called recursively
/lsf/01/1616414381.3896910: line 8: 55448 Aborted                 python 02_macau_model.py --input_file training_sample_data.pkl

tvandera commented 3 years ago

Hi, instead of using Method: CG Solver with tolerance: 1.00e-06, try using the direct inversion method.

Set direct=True in addSideInfo or in MacauSession

nbosc commented 3 years ago

Still working on macOS, fails at the same stage on Linux

Using OpenMP with up to 6 threads.
PythonSession {
  Data: {
    Type: ScarceMatrixData [with NAs]
    Component-wise mean: -0.950016
    Component-wise variance: 0.58482
    Noise: Probit Noise with threshold 0
    Size: 12644 [500 x 100] (25.29%)
      Warning: 11 empty cols
  }
  Model: {
    Num-latents: 32
  }
  Priors: {
    0: MacauPrior
     SideInfo: DenseDouble [500, 1030]
     Method: Cholesky Decomposition
     BetaPrecision: fixed at 5.00
    1: NormalPrior
  }
  Result: {
    Test data: 12645 [500 x 100] (25.29%)
    Binary classification threshold: 0.00
      2.39% positives in test data
  }
  Config: {
      Iterations: 40 burnin + 100 samples
      Save model: every 5 iteration
      Save prefix: /scratch/tmp6q0d7stj/
      Save extension: .ddm
  }
}
 ====== Initial phase ======
Initial   0/  0: RMSE: nan (1samp: nan)  U:[0.00e+00, 0.00e+00, ] [took: 0.0s, total: 0.0s]
 ====== Sampling (burning phase) ======
Burnin   1/ 40: RMSE: nan (1samp: 2.6843) AUC:nan (1samp: 0.4812)  U:[4.00e+01, 1.65e+01, ] [took: 0.2s, total: 0.2s]
Burnin   2/ 40: RMSE: nan (1samp: 3.0714) AUC:nan (1samp: 0.5856)  U:[4.10e+01, 1.86e+02, ] [took: 0.1s, total: 0.2s]
Burnin   3/ 40: RMSE: nan (1samp: 128.8635) AUC:nan (1samp: 0.6219)  U:[4.36e+01, 1.33e+04, ] [took: 0.1s, total: 0.3s]
Burnin   4/ 40: RMSE: nan (1samp: 139.8078) AUC:nan (1samp: 0.4775)  U:[4.51e+01, 1.16e+04, ] [took: 0.1s, total: 0.4s]
Burnin   5/ 40: RMSE: nan (1samp: 182799.3654) AUC:nan (1samp: 0.5110)  U:[4.42e+01, 2.20e+07, ] [took: 0.1s, total: 0.5s]
Burnin   6/ 40: RMSE: nan (1samp: 1846531.0205) AUC:nan (1samp: 0.4930)  U:[4.46e+01, 2.95e+08, ] [took: 0.1s, total: 0.5s]
Burnin   7/ 40: RMSE: nan (1samp: 1363690456.2558) AUC:nan (1samp: 0.5217)  U:[4.50e+01, 3.16e+11, ] [took: 0.1s, total: 0.6s]
terminate called recursively
terminate called recursively
terminate called recursively
/lsf/01/1616425828.3919693: line 8: 124386 Aborted                 python 02_macau_model.py --input_file training_sample_data.pkl

tvandera commented 3 years ago

Okay, I think we need more info here on what you are trying to do.

From the fact that you are using ProbitNoise, you want to factor a binary matrix, right? What are the two values in your matrix?
You only have 2.39% positives in your test set, this is not a lot. How many positives/negatives in your train set?
Have you tried without sideinfo?
Have you tried a different noise model? Ignoring the fact that your matrix is binary?

Cheers, Tom

nbosc commented 3 years ago

Right. Before I'd like to reiterate that with the same data set and the same version of smurff but a different OS, this works fine. For your questions:

binary matrix indeed. 1 for active, -1 for inactive, None for missing value. Threshold of 0.
this sample dataset has only 500 row and 100 columns. It's only purpose is to check that my script is working. 24671 inactives and 618 actives. Huge discrepancy I know but as I said it works on macOS...
without sideInfo it fails as well.
I have not tried a different noise model. I am new with matrix factorisation and for now I try what's in the tutorial.

tvandera commented 3 years ago

Hi, indeed, if it works on macOS, it should also work on Linux. Maybe the easiest way would be for me to reproduce the problem?

nbosc commented 3 years ago

That would be very helpful, thanks. Can you share an email so I can send you the dataset and the script?

nbosc commented 3 years ago

Thanks @tvandera .

Apparently there was something wrong in my conda environment and the problem was solved by creating a new one.

ExaScience / smurff

training fails on Linux #137