GispoCoding / eis_toolkit

Python library for mineral prospectivity mapping
https://eis-he.eu/
European Union Public License 1.2
26 stars 8 forks source link

[Bug]: `test_pca_with` functions result in inverse results compared to expected #454

Open nialov opened 1 week ago

nialov commented 1 week ago

Describe the bug

In continous integration tests of eis_toolkit feedstock, test_pca_with_* functions produce inverse results to the expected. See:

Important part:

2024-11-11T12:23:31.8627368Z _________________________ test_pca_with_nodata_removal _________________________
2024-11-11T12:23:31.8631833Z 
2024-11-11T12:23:31.8641316Z     @pytest.mark.xfail(sys.platform == "win32", reason="Results deviate on Windows.", raises=AssertionError)
2024-11-11T12:23:31.8641882Z     def test_pca_with_nodata_removal():
2024-11-11T12:23:31.8642301Z         """Test that PCA function gives correct output for input that has specified nodata values and removal strategy."""
2024-11-11T12:23:31.8643016Z         data = np.array([[1, 1], [2, -9999], [3, 3]])
2024-11-11T12:23:31.8643559Z         pca_array, principal_components, explained_variances, explained_variance_ratios = compute_pca(
2024-11-11T12:23:31.8644105Z             data, 2, nodata_handling="remove", nodata=-9999
2024-11-11T12:23:31.8644491Z         )
2024-11-11T12:23:31.8644768Z     
2024-11-11T12:23:31.8645309Z         expected_pca_values = np.array([[-1.414, 0.0], [np.nan, np.nan], [1.414, 0.0]])
2024-11-11T12:23:31.8645911Z         expected_component_values = np.array([[0.707, 0.707], [-0.707, 0.707]])
2024-11-11T12:23:31.8646370Z         expected_explained_variance_ratios_values = [1.0, 0.0]
2024-11-11T12:23:31.8646756Z     
2024-11-11T12:23:31.8647085Z         np.testing.assert_equal(principal_components.size, 4)
2024-11-11T12:23:31.8647533Z         np.testing.assert_equal(explained_variances.size, 2)
2024-11-11T12:23:31.8647982Z         np.testing.assert_equal(explained_variance_ratios.size, 2)
2024-11-11T12:23:31.8648430Z         np.testing.assert_equal(pca_array.shape, DATA.shape)
2024-11-11T12:23:31.8648808Z     
2024-11-11T12:23:31.8649156Z         np.testing.assert_array_almost_equal(pca_array, expected_pca_values, decimal=3)
2024-11-11T12:23:31.8649658Z >       np.testing.assert_array_almost_equal(principal_components, expected_component_values, decimal=3)
2024-11-11T12:23:31.8649988Z 
2024-11-11T12:23:31.8650343Z tests/exploratory_analyses/pca_test.py:158: 
2024-11-11T12:23:31.8651030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2024-11-11T12:23:31.8651667Z ../_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/lib/python3.9/contextlib.py:79: in inner
2024-11-11T12:23:31.8652420Z     return func(*args, **kwds)
2024-11-11T12:23:31.8652873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2024-11-11T12:23:31.8653130Z 
2024-11-11T12:23:31.8653512Z args = (<function assert_array_almost_equal.<locals>.compare at 0x7ff0a0b30ca0>, array([[ 0.70710678,  0.70710678],
2024-11-11T12:23:31.8671665Z        [ 0.70710678, -0.70710678]]), array([[ 0.707,  0.707],
2024-11-11T12:23:31.8672174Z        [-0.707,  0.707]]))
2024-11-11T12:23:31.8672883Z kwds = {'err_msg': '', 'header': 'Arrays are not almost equal to 3 decimals', 'precision': 3, 'verbose': True}
2024-11-11T12:23:31.8673134Z 
2024-11-11T12:23:31.8673375Z     @wraps(func)
2024-11-11T12:23:31.8673637Z     def inner(*args, **kwds):
2024-11-11T12:23:31.8673892Z         with self._recreate_cm():
2024-11-11T12:23:31.8674169Z >           return func(*args, **kwds)
2024-11-11T12:23:31.8674435Z E           AssertionError: 
2024-11-11T12:23:31.8674722Z E           Arrays are not almost equal to 3 decimals
2024-11-11T12:23:31.8674983Z E           
2024-11-11T12:23:31.8675391Z E           Mismatched elements: 2 / 4 (50%)
2024-11-11T12:23:31.8675695Z E           Max absolute difference: 1.41410678
2024-11-11T12:23:31.8676074Z E           Max relative difference: 2.00015103
2024-11-11T12:23:31.8676412Z E            x: array([[ 0.707,  0.707],
2024-11-11T12:23:31.8676817Z E                  [ 0.707, -0.707]])
2024-11-11T12:23:31.8677165Z E            y: array([[ 0.707,  0.707],
2024-11-11T12:23:31.8677548Z E                  [-0.707,  0.707]])

The expected is:

expected_component_values = np.array([[0.707, 0.707], [-0.707, 0.707]])

but the result is:

array([[ 0.707,  0.707], [0.707,  -0.707]])

Environment details

I am wondering if there is some sort in code of pca that might cause different results or if the order of the results matter?

chudasama-bijal commented 3 days ago

@nmaarnio could you check whether there is anything wrong with the computation codes of this or redirect to the source team?

In principle, the components are the unit eigenvectors; together these comprise the transformation matrix used to compute the coordinates of the data in the 'PC' rotated space. The order matters because each is a coefficient of transformation in the corresponding dimension. And if their values are identical, a sign inversion could imply rotation in opposite direction, unless all components have reversed signs.

The current test data shows perfect linear correlation, the solution for second component becomes inconsequential because the corresponding eigenvalue is 0. Perhaps the test data could be such that it is not 'x = y' kind of data, then the covariance matrix will be a nonsingular matrix.

chudasama-bijal commented 2 days ago

Both [[ 0.707, 0.707], [0.707, -0.707]] and [[ 0.707, 0.707], [-0.707, 0.707]] are valid solutions for this test data. It's a matter of convention that the functions follow. I suggest for test functionstest_pca_with_nodata_removal() and test_pca_with_nan_removal() the expected_component_values be changed to expected_component_values = np.array([[0.707, 0.707], [0.707, -0.707]]). That will add consistency to the code and resolve this issue. However, it should be checked with some varied test data that has a unique solution.

nialov commented 2 days ago

I suggest for test functionstest_pca_with_nodata_removal() and test_pca_with_nan_removal() the expected_component_values be changed to expected_component_values = np.array([[0.707, 0.707], [0.707, -0.707]]). That will add consistency to the code and resolve this issue. However, it should be checked with some varied test data that has a unique solution.

Note that the tests run successfully in the conda environment defined in this repository. So if we change the expected_component_values, the tests here will fail instead but tests in the feedstock repository will succeed. Should the convention be strict? Or is it okay/expected for the user to get these values in different order?

If the latter, I can disable the tests in the feedstock and make a pull request here that relaxes the tests to reflect this.

If the former, should the function be changed to better reinforce the order?

nmaarnio commented 2 days ago

Hi @chudasama-bijal and @nialov. I am busy with other work for some time and I have many things in my EIS TODO list already. Could somebody else check this and/or make the decision? I don't think my opinion should weigh any more than someone else's in this

nialov commented 1 day ago

@nmaarnio Yes, we can handle it!