GUDHI / gudhi-devel

The GUDHI library is a generic open source C++ library, with a Python interface, for Topological Data Analysis (TDA) and Higher Dimensional Geometry Understanding.
https://gudhi.inria.fr/
MIT License
245 stars 65 forks source link

Implementing sklearn set_output API for BaseEstimator s #1093

Open martinroyer opened 1 week ago

martinroyer commented 1 week ago

Simple example showing how to implement scikit-learn's set_output API and an example test.

This outputs diagrams nicely:

>>> RipsPersistence(homology_dimensions=[0, 2], n_jobs=-2).set_output(transform="pandas").fit_transform(point_clouds)
                                       H0  H2
0  [[0.0, 10.456032537513941], [0.0, inf]]  []
1  [[0.0, 12.925628779692705], [0.0, inf]]  []
2  [[0.0, 11.126843373147965], [0.0, inf]]  []
3  [[0.0, 12.647790894762348], [0.0, inf]]  []
4  [[0.0, 11.340625000427952], [0.0, inf]]  []

Let me add that although this PR is very simple, I am a bit sore that it took so long for me to write for the following reasons:

So we could probably get rid of this dependency if I understood that correctly...

mglisse commented 1 week ago
  • CGAL is currently quite hard to install on some windows : for some reason if installed with conda install cgal-cpp, conda will not also install GMP (conda claims specs are incompatible) so gudhi will not be happy on compilation because it cannot find GMP

On windows, gudhi and cgal normally use mpir instead of gmp (mpir should install files gmp.h, libgmp.*, etc for compatibility). If that doesn't work, it is worth investigating.

  • but all this turned to be actually unnecessary, as the only reason for gudhi.datasets.generators.points to import the _points.cc is to ... generate points on a sphere or a torus, which I reckon can easily be done in python too!

The policy so far has been that CGAL is a dependency for a core part of Gudhi (alpha-complex). And once it is there, we might as well use it, users don't care much where it comes from, but they might (ok, maybe not here) care if it becomes 10x slower, or less reliable.

It could make sense to try and simplify things for people who work on pure Python parts of Gudhi, by reusing a pre-built package (so they wouldn't even need a C++ compiler), but I don't know how hard that would be.

VincentRouvreau commented 6 days ago

Looks good to me, except for one detail: @VincentRouvreau do we need to disable the test if pandas is not installed? (we have things like if(SKLEARN_FOUND) before the add_tests in python/CMakeLists.txt)

Yes this test should not be run if pandas is not installed (as it is an optinal dependency). There 2 ways to do so:

martinroyer commented 6 days ago

Looks good to me, except for one detail: @VincentRouvreau do we need to disable the test if pandas is not installed? (we have things like if(SKLEARN_FOUND) before the add_tests in python/CMakeLists.txt)

Yes this test should not be run if pandas is not installed (as it is an optinal dependency). There 2 ways to do so:

  • Modify your test like that:
def test_set_output():
    try:
        import pandas
        NB_PC = 5
        point_clouds = [points.sphere(n_samples=random.randint(100, 150), ambient_dim=2) for _ in range(NB_PC)]

        rips = RipsPersistence(homology_dimensions=[0, 2], n_jobs=-2)
        diags_pandas = rips.set_output(transform="pandas").fit_transform(point_clouds)
        assert 'H0' == diags_pandas.columns[0]
        assert 'H2' == diags_pandas.columns[1]
        assert len(diags_pandas.index) == NB_PC
    except ImportError:
        pass
  • Separate your test in a dedicated file and add some cmake magic to run it only when pandas is installed (I can help if you choose this option)

First option looks simple and therefore very good to me!

martinroyer commented 6 days ago
  • CGAL is currently quite hard to install on some windows : for some reason if installed with conda install cgal-cpp, conda will not also install GMP (conda claims specs are incompatible) so gudhi will not be happy on compilation because it cannot find GMP

On windows, gudhi and cgal normally use mpir instead of gmp (mpir should install files gmp.h, libgmp.*, etc for compatibility). If that doesn't work, it is worth investigating.

So maybe I was wrong in thinking I needed to install GMP then. The thing is, with cgal-cpp installed via conda, on running the cmake command it fails with:

-- Building for: Visual Studio 17 2022
-- Selecting Windows SDK version 10.0.22621.0 to target Windows 10.0.22631.
-- The C compiler identification is MSVC 19.40.33811.0
-- The CXX compiler identification is MSVC 19.40.33811.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.40.33807/bin/Hostx64/x64/cl.exe - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.40.33807/bin/Hostx64/x64/cl.exe - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- GUDHI version : 3.10.0rc2
CMake Warning (dev) at src/cmake/modules/GUDHI_third_party_libraries.cmake:3 (find_package):
  Policy CMP0167 is not set: The FindBoost module is removed.  Run "cmake
  --help-policy CMP0167" for policy details.  Use the cmake_policy command to
  set the policy and suppress this warning.

Call Stack (most recent call first):
  CMakeLists.txt:20 (include)
This warning is for project developers.  Use -Wno-dev to suppress it.

-- Boost toolset is unknown (compiler MSVC 19.40.33811.0)
-- Boost toolset is unknown (compiler MSVC 19.40.33811.0)
-- Boost toolset is unknown (compiler MSVC 19.40.33811.0)
++ BOOST version 1.84.0. Includes found in C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/include, libraries found in C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/lib
-- Could NOT find GMP (missing: GMP_LIBRARIES GMP_INCLUDE_DIR)
-- Visual Leak Detector (VLD) is not found.
-- Using header-only CGAL
CMake Error at C:/Program Files/CMake/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
  Could NOT find GMP (missing: GMP_LIBRARIES GMP_INCLUDE_DIR)
Call Stack (most recent call first):
  C:/Program Files/CMake/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
  src/cmake/modules/FindGMP.cmake:54 (find_package_handle_standard_args)
  C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/lib/cmake/CGAL/CGAL_SetupGMP.cmake:24 (find_package)
  C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/lib/cmake/CGAL/CGAL_SetupCGALDependencies.cmake:37 (include)
  C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/lib/cmake/CGAL/CGALConfig.cmake:168 (include)
  src/cmake/modules/GUDHI_third_party_libraries.cmake:30 (find_package)
  CMakeLists.txt:20 (include)

-- Configuring incomplete, errors occurred!

Not sure what to do then...

mglisse commented 6 days ago
    except ImportError:
        pass

Maybe print("Missing pandas, skipping set_output test") instead of pass?

mglisse commented 6 days ago

Not sure what to do then...

Hmm, we'll need to investigate that :disappointed: Can you confirm that the package mpir is installed? In C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/include, do you have files mpir.h and gmp.h? What libraries with similar names do you have in lib/ or bin/?

mglisse commented 6 days ago

@VincentRouvreau IIRC you adapted FindGMP.cmake from CGAL? The one in CGAL seems to have had a few patches since then, maybe we need them as well...

There could also be bad interactions where CGAL_SetupGMP.cmake calls find_package(GMP) expecting its own FindGMP.cmake to be used, but ours ends up earlier in the path, so it has to be compatible.

martinroyer commented 6 days ago

Not sure what to do then...

Hmm, we'll need to investigate that 😞 Can you confirm that the package mpir is installed? In C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/include, do you have files mpir.h and gmp.h? What libraries with similar names do you have in lib/ or bin/?

Yep those two are there, mpir.h and gmp.h, and bin/ has gmp.dll, mpir.dll, and lib/ has gmp.lib and mpir.lib.

martinroyer commented 6 days ago

and mpir is installed, mpir 3.0.0 he025d50_1002 conda-forge

mglisse commented 6 days ago

Strange, on windows, I created a fresh miniforge environment, installed cgal-cpp, cmake and git in this env, cloned the gudhi-devel repository, went to a build subdir, ran cmake .., and it did find GMP just fine... Maybe you could run cmake with --debug-find-pkg=GMP so it tells us more? The cmake you are using is the one from conda?

martinroyer commented 6 days ago

Strange, on windows, I created a fresh miniforge environment, installed cgal-cpp, cmake and git in this env, cloned the gudhi-devel repository, went to a build subdir, ran cmake .., and it did find GMP just fine... Maybe you could run cmake with --debug-find-pkg=GMP so it tells us more? The cmake you are using is the one from conda?

Oh good find indeed! I get that find_path considered the following locations:

C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/include/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/mingw-w64/bin/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/mingw-w64/bin/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/usr/bin/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/usr/bin/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/bin/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/bin/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Scripts/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Scripts/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/bin/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/bin/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/condabin/include/gmp.h
C:/Users/martin.royer/AppData/Local/miniconda3/condabin/gmp.h

... but none of these locations is C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/include where the file really is! Any idea how to solve this?

martinroyer commented 6 days ago

I used a cmake I downloaded and installed, let's try with one from conda.

martinroyer commented 6 days ago

I used a cmake I downloaded and installed, let's try with one from conda.

... and that did it for me! -- Found GMP: C:/Users/martin.royer/AppData/Local/miniconda3/envs/gudhi/Library/lib/gmp.lib. I could install and run the .cc files that I couldn't before. Thanks Marc.