Closed Db-pckr closed 2 years ago
Update: Tested vs R Psych library, same issue. Realized that the problem seems to occurr when the correlation matrix is all positive. If even 1 value from it is negative, the results match. Maybe this helps debug...
Thanks for the follow up! I will look into this very soon.
Update: Tested vs R Psych library, same issue. Realized that the problem seems to occurr when the correlation matrix is all positive. If even 1 value from it is negative, the results match. Maybe this helps debug...
I can't seem to reproduce this issue. For example, using the data provided above (data
), here are my results:
import pandas as pd
from factor_analyzer import FactorAnalyzer
df = pd.DataFrame([[float(val) for val in row.split(' | ')]
for row in data.strip().split('\n')])
fa = FactorAnalyzer(method='minres',
n_factors=1,
rotation=None,
bounds=(0.005, 1),
is_corr_matrix=True).fit(df)
print(fa.loadings_)
[[0.3879858 ]
[0.66334567]
[0.32897377]
[0.55966426]
[0.66396016]
[0.81430826]
[0.8469053 ]
[0.63367546]
[0.44783303]
[0.69420312]
[0.58345214]
[0.6963522 ]]
This matches R's psych
library. Let me know if I'm missing something!
I'm using: pandas 1.2.4 numpy 1.20.2 python 3.8.10
If you're getting correct results I would guess it's because of an older numpy version to be honest, and how it is used internally in factor_analyzer. Thanks!
I encounter the same issue (negative factor loadings):
I'm using: pandas 1.4.3 numpy 1.21.5 python 3.9.12
Which version of packages do you suggest for avoiding this please?
Thanks!
@celip38 please share your data, if possible, so we can try to reproduce the issue.
@desilinguist You can use the data I presented above to test this problem.
Thanks @Db-pckr. I can replicate this on my end too with the latest numpy library.
I poked around a bit and found that numpy.linalg.eigh()
used for the eigenvalue decomposition was returning an all-negative first eigenvector for this correlation matrix whereas if the more general – but less efficient – numpy.linalg.eig()
returns an all-positive first eigenvector, viz.
With eigh()
:
array([[-0.17816009],
[-0.30460323],
[-0.15106223],
[-0.25699352],
[-0.3048854 ],
[-0.37392409],
[-0.3888924 ],
[-0.2909789 ],
[-0.20564148],
[-0.31877273],
[-0.26791673],
[-0.31975958]])
and, with eig()
:
array([[0.17816009],
[0.30460323],
[0.15106223],
[0.25699352],
[0.3048854 ],
[0.37392409],
[0.3888924 ],
[0.2909789 ],
[0.20564148],
[0.31877273],
[0.26791673],
[0.31975958]])```
However, neither is incorrect because, as we know, if $v$ is an eigenvector, then so is $\alpha*v$, where $\alpha$ is any scalar ( $\neq 0$ ). It also follows that signs on factor loadings are also kind of meaningless because all they do is flip the (already arbitrary) interpretation of the latent factor.
So, while we could replace eigh()
with eig()
to force the results to match what SPSS and R do, I am not convinced that we need to do that since this is not really a bug.
@jbiggsets any thoughts?
Yes, I would be inclined not to change this, since it doesn't really strike me as a bug and we use eigh
pretty consistently throughout. Maybe we can mention it in the documentation?
Adding to the documentation sounds like a good idea. I'll do that!
Thanks a lot!
Based on a correlation matrix, the calculated results with factor_analyzer are different than when running in SPSS, as in they seem to be multiplied by (-1). Communalities are more or less equal.
Example Matrix below (12x12):
1.00 | 0.53 | 0.26 | 0.14 | 0.18 | 0.24 | 0.24 | 0.22 | 0.20 | 0.21 | 0.21 | 0.36 0.53 | 1.00 | 0.33 | 0.34 | 0.39 | 0.51 | 0.50 | 0.42 | 0.27 | 0.43 | 0.35 | 0.52 0.26 | 0.33 | 1.00 | 0.22 | 0.28 | 0.24 | 0.27 | 0.28 | 0.09 | 0.16 | 0.03 | 0.18 0.14 | 0.34 | 0.22 | 1.00 | 0.56 | 0.47 | 0.49 | 0.34 | 0.28 | 0.37 | 0.27 | 0.29 0.18 | 0.39 | 0.28 | 0.56 | 1.00 | 0.55 | 0.59 | 0.49 | 0.25 | 0.43 | 0.30 | 0.40 0.24 | 0.51 | 0.24 | 0.47 | 0.55 | 1.00 | 0.80 | 0.55 | 0.30 | 0.51 | 0.49 | 0.55 0.24 | 0.50 | 0.27 | 0.49 | 0.59 | 0.80 | 1.00 | 0.56 | 0.31 | 0.58 | 0.50 | 0.56 0.22 | 0.42 | 0.28 | 0.34 | 0.49 | 0.55 | 0.56 | 1.00 | 0.27 | 0.37 | 0.32 | 0.42 0.20 | 0.27 | 0.09 | 0.28 | 0.25 | 0.30 | 0.31 | 0.27 | 1.00 | 0.55 | 0.28 | 0.29 0.21 | 0.43 | 0.16 | 0.37 | 0.43 | 0.51 | 0.58 | 0.37 | 0.55 | 1.00 | 0.52 | 0.51 0.21 | 0.35 | 0.03 | 0.27 | 0.30 | 0.49 | 0.50 | 0.32 | 0.28 | 0.52 | 1.00 | 0.55 0.36 | 0.52 | 0.18 | 0.29 | 0.40 | 0.55 | 0.56 | 0.42 | 0.29 | 0.51 | 0.55 | 1.00
Code:
fa = FactorAnalyzer(method='minres', n_factors=1, rotation=None, is_corr_matrix=True, bounds=(0.005, 1)) fa.fit(fa_df) print(fa.loadings_)
Result: [[-0.3883726 ] [-0.66186571] [-0.32924641] [-0.55939523] [-0.66587561] [-0.81117504] [-0.8457027 ] [-0.6317671 ] [-0.44712954] [-0.69460134] [-0.58560195] [-0.69916123]]
SPSS Code produces almost the same result except each value *(-1) (i.e. positive value/absolute value), while get_communalities() returns mostly the same values (mostly because SPSS rounds values on display).
Any idea on what am I missing or what is the issue?
Thanks