JuliaAI / MLJModels.jl

Home of the MLJ model registry and tools for model queries and mode code loading
MIT License
79 stars 27 forks source link

Update model registry and "List of Models" to reflect addition of CatBoost models #501

Closed ablaom closed 1 year ago

ablaom commented 1 year ago

@azev77 @ericphanson

ablaom commented 1 year ago

Okay, I've run into a problem here.

First note that the way the registry currently works, all model-providing packages must be imported simultaneously. In hindsight this sounds like a dumb idea but it's actually not caused as many problems.

However, ScikitLearn.jl and CatBoost.jl are not playing nicely:

# in fresh environment:
(jl_AmJisH) pkg> add ScikitLearn CatBoost

julia> using CatBoost

julia> using ScikitLearn
Error processing line 1 of /Users/anthony/anaconda2/envs/py37/lib/python3.7/site-packages/matplotlib-3.4.3-py3.7-nspkg.pth:

Fatal Python error: initsite: Failed to import the site module
Traceback (most recent call last):
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/site.py", line 168, in addpackage
    exec(line)
  File "<string>", line 1, in <module>
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/importlib/util.py", line 14, in <module>
    from contextlib import contextmanager
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/contextlib.py", line 5, in <module>
    from collections import deque
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/collections/__init__.py", line 24, in <module>
    import heapq as _heapq
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/heapq.py", line 587, in <module>
    from _heapq import *
SystemError: initialization of _heapq did not return an extension module

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/site.py", line 579, in <module>
    main()
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/site.py", line 566, in main
    known_paths = addsitepackages(known_paths)
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/site.py", line 349, in addsitepackages
    addsitedir(sitedir, known_paths)
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/site.py", line 207, in addsitedir
    addpackage(sitedir, name, known_paths)
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/site.py", line 178, in addpackage
    import traceback
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/traceback.py", line 3, in <module>
    import collections
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/collections/__init__.py", line 24, in <module>
    import heapq as _heapq
  File "/Users/anthony/anaconda2/envs/py37/lib/python3.7/heapq.py", line 587, in <module>
    from _heapq import *
SystemError: initialization of _heapq did not return an extension module

and Julia exits.

@ericphanson @tylerjthomas9 Any insights here?

ericphanson commented 1 year ago

No 🤔. I don't see any issues in upstream catboost either: https://github.com/catboost/catboost/issues?q=is%3Aissue+scikitlearn+

But I do find this one: https://github.com/JuliaPy/pyjulia/issues/150

tylerjthomas9 commented 1 year ago

https://github.com/cjdoris/PythonCall.jl/issues/220

I think that PyCall.jl in ScikitLearn.jl and PythonCall.jl in CatBoost.jl are calling different python versions. Here is a method (not very pretty) of fixing this issue between the two libraries:

pkg> add ScikitLearn CatBoost PythonCall

julia> using PythonCall, Pkg

julia> ENV["PYTHON"] = PythonCall.C.CTX.exe_path

julia> Pkg.build("PyCall")

julia> using CatBoost

julia> using ScikitLearn
ablaom commented 1 year ago

@tylerjthomas9 This is a great help and explains the problem. Unfortunately, after playing around for a few hours, I cannot get things to work locally in the context of the model registry process. And this also needs to work in CI, which checks the registry. There may be a way, but I can see this is going to be a high-maintenance hack.

The bigger picture is that MLJ users do want to load multiple models simultaneously for model comparison, but it doesn't seem this can work at present for PyCall / PythonCall models without introducing package management headaches beyond the average user.

For now, I will add CatBoost.jl to the list of Third Party Packages to give it some visibility. And we can add it to the List of Models with an asterix tagging as unregistered. Happy to hear your thoughts on this.

For the record, OutlierDetectionPython.jl also uses PyCall, and its models are in the MLJ Model Registry. @davnn Do you have any inclination to move towards PythonCall? In discussions I have been having, it seems there is some consensus, and even some time commitment, to move ScikitLearn in this direction.

davnn commented 1 year ago

For the record, OutlierDetectionPython.jl also uses PyCall, and its models are in the MLJ Model Registry. @davnn Do you have any inclination to move towards PythonCall? In discussions I have been having, it seems there is some consensus, and even some time commitment, to move ScikitLearn in this direction.

I would be happy to swap to PythonCall, but last time I checked I ran into some troubles and kept PyCall for now. I'll reevaluate in the near future.