fabian-sp / GGLasso

A Python package for General Graphical Lasso computation
MIT License
30 stars 14 forks source link

Warnings for non-positive-definite #41

Closed lensory closed 1 year ago

lensory commented 1 year ago

Hello, I am getting the following warnings when trying to compute the inverse covariance matrix (SGL). I'm not sure about the meaning of this warning message and don't know how to solve it. (for example, the EV?) I have already searched in the documentations but did'n find anything related.

SINGLE GRAPHICAL LASSO PROBLEM 
Regularization parameters:
{'lambda1': 0.1, 'mu1': None}
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0031095006544593776)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0025232997414576415)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.002806690357338764)
......

My source code is similar to the example given in the doc. tmp is a data matrix. Note: the feature dimension may be much larger than the number of instances (31), so it may be a very sparse matrix.

from gglasso.problem import glasso_problem
P = glasso_problem(tmp.cov().to_numpy(), 31, reg_params = {'lambda1': 0.1}, latent = False, do_scaling = False)
print(P)

lambda1_range = np.logspace(1, -3, 30)
modelselect_params = {'lambda1_range': lambda1_range}

P.model_selection(modelselect_params = modelselect_params, method = 'eBIC', gamma = 0.1)

# regularization parameters are set to the best ones found during model selection
print(P.reg_params)

Could you please help me check what the problem is?

Thanks.

fabian-sp commented 1 year ago

Hi,

indeed now that I see it, this warning is a bit cryptic. I will try to explain: The solution of SGL should be positive definite. We check this at the end of optimization. If the minimal eigen-value (EV) is negative or zero, the warning will be raised. In your case, as you do not have latent variables, you can ignore the -L in the warning, so it just checks the positive definiteness of Theta, the estimated inverse covariance.

As you see, in your example the minimal eigenvalue is actually negative, but very small (order 1e-3). So this might occur only due to numerical inaccuracies in the optimization. This is also why it proposes to decrease the tolerance (argument tol and rtol in model_selection(); but as it says that it reached the maximum number of iterations, I would doubt that decreasing the tolerance helps here.

It's a bit hard to judge from far, but generally I would say that if you don't see sth odd in the computed inverse covariance, this really might just be due to numerical precision (especially if the dimension is large).

You might also try to set do_scaling = True. This scales your input data to correlations before solving and might lead to better conditioned problems. Note that the output then gets rescaled, but the lambda that is chosen might change. More info is here: https://gglasso.readthedocs.io/en/latest/problem-object.html

Let me know if you have more insights or questions!

Cheers, Fabian

lensory commented 1 year ago

Hi,

Thanks for your reply. I think there are several different possible problems in these situation. First, the numerical inaccruacies due to the large feature dimension (>=2000), could be solved according to your suggestions (tol, rtol, do_scaling). Second, the max iterations, could be improved by increasing the max iteration.

However, when I do the model selection, I found that the final best was not the one with getting to the primal optimal or dual optimal, but the one reached the max iteration. I think this is a little weird because usually reaching the max iteration means that it didn't get the optimal solution and is not a "good" solution.

SINGLE GRAPHICAL LASSO PROBLEM 
Regularization parameters:
{'lambda1': 0.1, 'mu1': None}
ADMM terminated after 1000 iterations with status: max iterations reached.
ADMM terminated after 1000 iterations with status: max iterations reached.
ADMM terminated after 1000 iterations with status: max iterations reached.
ADMM terminated after 1000 iterations with status: max iterations reached.
ADMM terminated after 1000 iterations with status: max iterations reached.
ADMM terminated after 1000 iterations with status: max iterations reached.
ADMM terminated after 1000 iterations with status: max iterations reached.
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0021227830433796523)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0027637757226664506)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0024980472220664284)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0031157289023721405)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.002613514759527239)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0027421185333019473)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0021038561223048974)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0014578519520370265)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0002832045885465038)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.00010602873832936269)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0003077214082750819)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -7.806547028343511e-05)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.00044790781769210624)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.00013656434051542582)
ADMM terminated after 1000 iterations with status: max iterations reached.
ADMM terminated after 1000 iterations with status: max iterations reached.
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -9.168678289833889e-06)
ADMM terminated after 1000 iterations with status: max iterations reached.
ADMM terminated after 1000 iterations with status: max iterations reached.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
ADMM terminated after 1000 iterations with status: primal optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -7.1845810290354825e-06)
ADMM terminated after 1000 iterations with status: primal optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -1.3224303285816912e-05)
ADMM terminated after 1000 iterations with status: primal optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -2.3739348199640197e-05)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0001855918466465662)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0002903861115240558)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.00012710845049182642)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.00013194159056960483)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.00012138628317324545)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -1.0027505990202075e-05)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.00022145151636735277)
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0004746158413973623)
{'lambda1': 0.013257113655901081, 'mu1': 0}
>>> P.solve()
ADMM terminated after 1000 iterations with status: max iterations reached.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -4.317168472991401e-05)

Thanks.

fabian-sp commented 1 year ago

Ok, so I would try two things. first do_scaling=True and solving manually for the lambda that was selected in the model selection. You can do this with https://github.com/fabian-sp/GGLasso/blob/fe1c870ce968ac400a90af9667602ccb6f339f96/gglasso/problem.py#L371

The argument solver_params lets you specify arguments for the solver, eg solver_params={'max_iter' : ...}

Also maybe set verbose=True there to see a bit more. If you still have a negative eigenvalue for a massively large number of iterations, then there really might be sth off in your data.

lensory commented 1 year ago

Thanks for your help. However, sometimes when I running the function model_selection, an key error was reported. Could you please give me some explanation about this error? Also, I think it is better to have a "verbose" switch when doing the model selection, through which we can find the specific parameter setting inducing this error more easily.

KeyError                                  Traceback (most recent call last)
Cell In[36], line 34
     32 lambda1_range = np.logspace(0, -5, 50)
     33 modelselect_params = {'lambda1_range': lambda1_range}
---> 34 P.model_selection(modelselect_params = modelselect_params, method = 'eBIC', gamma = 0.1)
     35 # regularization parameters are set to the best ones found during model selection
     36 print(P.reg_params)

File ~/.conda/envs/dask/lib/python3.10/site-packages/gglasso/problem.py:634, in glasso_problem.model_selection(self, modelselect_params, method, gamma, tol, rtol, store_all)
    632     self.solution._set_solution(Theta = sol['Theta'], L = sol['L'])   
    633 else:
--> 634     self.solution._set_solution(Theta = sol['Theta'])   
    637 self.modelselect_stats = stats.copy()
    640 return

KeyError: 'Theta'
fabian-sp commented 1 year ago

Hi, thanks for the suggestion of having a verbose mode in model_selection. Regarding your error, it is a bit weird. The variable sol is the solution of the Graphical Lasso problem and should always have the key Theta. Can you give the complete code when this happens ?

When, running P.model_selection(), you should also see some messages regarding the ADMM convergence even before this error appears?

lensory commented 1 year ago

Sure. Actually, this code is similar to the example.

from gglasso.problem import glasso_problem
P = glasso_problem(pivot.cov().to_numpy(), 31, reg_params = {'lambda1': 0.1}, latent = False, do_scaling = False)
print(P)
# P.solve(verbose=True)
lambda1_range = np.logspace(0, -3, 30)
modelselect_params = {'lambda1_range': lambda1_range}
P.model_selection(modelselect_params = modelselect_params, method = 'eBIC', gamma = 0.1)
# regularization parameters are set to the best ones found during model selection
print(P.reg_params)

The total logs are as below.

SINGLE GRAPHICAL LASSO PROBLEM 
Regularization parameters:
{'lambda1': 0.1, 'mu1': None}
ADMM terminated after 568 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0037999928915521575)
ADMM terminated after 435 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0035004382608983526)
ADMM terminated after 353 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.003141025851917124)
ADMM terminated after 390 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0030319646290409385)
ADMM terminated after 280 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.003184031354797763)
ADMM terminated after 202 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.003296948531971259)
ADMM terminated after 147 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0031285639676415984)
ADMM terminated after 111 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0026703415791629703)
ADMM terminated after 143 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.00047487423590455194)
ADMM terminated after 201 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.00017262489551763663)
ADMM terminated after 217 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0011639029074073696)
ADMM terminated after 272 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0016862749007426966)
ADMM terminated after 310 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0009212441910373238)
ADMM terminated after 344 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.000742142588398624)
ADMM terminated after 379 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0004950508102477793)
ADMM terminated after 416 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0003889038545128786)
ADMM terminated after 461 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0010440254232681352)
ADMM terminated after 476 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0009994842226706637)
ADMM terminated after 460 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0007213750464897432)
ADMM terminated after 502 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0004710842917620462)
ADMM terminated after 569 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.00031439687770522994)
ADMM terminated after 469 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0010475665132854257)
ADMM terminated after 485 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0005283634864800193)
ADMM terminated after 526 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.00048737810035485616)
ADMM terminated after 431 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0012164678318708276)
ADMM terminated after 412 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0007827924710154866)
ADMM terminated after 405 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0005009679702985216)
ADMM terminated after 400 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0003589746740152163)
ADMM terminated after 311 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.001116698400561832)
ADMM terminated after 285 iterations with status: optimal.
WARNING: Theta (Theta - L resp.) is not positive definite. Solve to higher accuracy! (min EV is -0.0007611137582692696)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[41], line 34
     32 lambda1_range = np.logspace(0, -3, 30)
     33 modelselect_params = {'lambda1_range': lambda1_range}
---> 34 P.model_selection(modelselect_params = modelselect_params, method = 'eBIC', gamma = 0.1)
     35 # regularization parameters are set to the best ones found during model selection
     36 print(P.reg_params)

File ~/.conda/envs/dask/lib/python3.10/site-packages/gglasso/problem.py:634, in glasso_problem.model_selection(self, modelselect_params, method, gamma, tol, rtol, store_all)
    632     self.solution._set_solution(Theta = sol['Theta'], L = sol['L'])   
    633 else:
--> 634     self.solution._set_solution(Theta = sol['Theta'])   
    637 self.modelselect_stats = stats.copy()
    640 return

KeyError: 'Theta'
lensory commented 1 year ago

If you need the pivot table to reproduce this error, please leave me a message about your email address and I will send it to you.

fabian-sp commented 1 year ago

you can send the data to fabian.schaipp@tum.de

But I think I know now what the issue is. It happens because for all lambda values the solution is not positive definite (see the warning), so the ebic might always be nan or inf.

Did you check that your input data is positive semidefinite (it should be as you use .cov())? You could try to add the identity matrix * epsilon to the input covariance matrix, and see if the error persists.

But good to know this, I will see how I can avoid this in a future release.

fabian-sp commented 1 year ago

also, did you try do_scaling= True?

lensory commented 1 year ago

Yes, I tried but the result was not so satisfactory. After applying this setting, the chosen lambda1 seemed always be 1, which means the largest penalization and the resulting matrix would be diagonal. However, in my application, I'm trying to find some relationship between different variables. In this case, I hope the coefficient between different variables in the inverse covariance matrix should be not always zero. By the way, during the model selection process, is there any criteria not requiring the solution to be strictly positive definite. As you can see, the EV value is quite closed to 0. (Also, I used .cov())

fabian-sp commented 1 year ago

You could do the following: if you run .model_selection() with the option store_all = True, you can access the solutions for all lambda-values in P._all_theta.

You could then select a desired estimator, for example based on a target sparsity.

Alternatively, you can always solve the problem for a specified lambda with P.solve(), you can set the lambda with P.set_reg_params({'lambda1': ...}). Even if the smallest EV is negative it will return a solution (and a warning). You could also loop over your lambda range (the model_selection function basically also just loops, and then selects according to the eBIC - one trick is to use the solution of the previous lambda as starting point for the next).

Just for the record, if the input is given as

S = pivot.corr().to_numpy()
S = S + 1e-3*np.eye(S.shape[0]) # add small value to diagonal

then ADMM converges just fine - but as you said it selects a diagonal precision matrix.

In general, as you are in a very high-dimensional setting, it is somewhat expected that a very sparse solution will be selected. You could try to set gamma=0 or method='AIC' - but this typically leads to more false positive nonzero entries. Another way to get more signal in your dataset could be to filter the variables - but I can not confidently give you advice there as i do not know the details of the dataset.

I hope that any of the above was helpful :)