bashtage / linearmodels

Additional linear models including instrumental variable and panel data models that are missing from statsmodels.
https://bashtage.github.io/linearmodels/
University of Illinois/NCSA Open Source License
950 stars 184 forks source link

Pseudo-inverse in covariance.py #619

Closed safrazRampersaud closed 2 months ago

safrazRampersaud commented 3 months ago

Hi Kevin,

Great resource re: PanelOLS, it has come in handy professionally! I run into an issue when some malformed matrices are used to compute covariance (def cov) in covariance.py which requires a call to the inverse method from numpy.linalg. Naturally errors or warnings should occur, which does happen in your library.

Specifically for my use case, I could check the condition number of the matrix to determine if singularity behavior is present prior to calling the fit method. Although, I'd prefer to not have the error break my application and would also prefer to not compute the condition number. As a substitute, I amended covariance.py with pseudo-inverse (pinv) instead of the inverse function (inv) both of which are from numpy.linalg. Wondering what your thoughts would be in supporting this substitution in another version of the library?

Thanks!

bashtage commented 2 months ago

Is your regressor matrix rank deficient? The issue is that the estimates are not well defined if the X'X matrix is rank deficient, and so using another inverse just imposes some arbitrary normalization to get numbers out, even though these don't have any particular meaning or validity.

Do you know where the singularity comes from?

safrazRampersaud commented 2 months ago

Thanks for the reply! Yes, we actually expect the singularities to occur as we have a process that searches a large space to build regressor matrices. Our goal is to focus on the approach within the search space, build regressor matrices for all combinations in the search space (some of which we know will be ill formed), calculate the inverse (prior to fitting) and if it's ill formed, we don't want to application to fail but we'd prefer continuity or other error handling. Bad results will get filtered out following the fit through p-value testing, errors, etc. thresholds but the point is to keep on processing. In my understanding, if inv would be switched for pinv then regressor matrices that are well formed would produce the same result. Pinv on ill formed matrices would return results that would be obvious to not consider.

So looking for a way to amend the library so that it either gives the user an option to use pinv so that the application doesn't break while searching. Looking forward to your response.

bashtage commented 2 months ago

pinv will continue when it is singular rather than raising. This means that someone who isn't aware could be fitting models that are not itentidied or that have misleading standard errors. What is wrong with catching the error and moving on in your code?

safrazRampersaud commented 2 months ago

Thanks for the reply. There is some downstream code management to consider catching the error. Not too bad for us to refactor, I wanted to get a temperature for avoiding that. We'll more than likely head that route. Good chatting, thanks again ...