arogozhnikov / hep_ml

Machine Learning for High Energy Physics.
https://arogozhnikov.github.io/hep_ml/
Other
179 stars 64 forks source link

Can this be saved to JSON instead of pickle? #86

Closed acampove closed 1 month ago

acampove commented 1 month ago

Hi,

When using this tool I tend to save the weights as a pickle file. However when loading them back I see things like:

  File "/home/acampove/Packages/RK/scripts/src/utils_noroot.py", line 1231, in load_pickle                                                  
    obj=pickle.load(open(path, 'rb'))                                                                                                       
  File "/home/acampove/Packages/micromamba/envs/rk/lib/python3.10/site-packages/dill/_dill.py", line 289, in load                           
    return Unpickler(file, ignore=ignore, **kwds).load()                                                                                    
  File "/home/acampove/Packages/micromamba/envs/rk/lib/python3.10/site-packages/dill/_dill.py", line 444, in load                           
    obj = StockUnpickler.load(self)                                                                                                         
  File "_tree.pyx", line 867, in sklearn.tree._tree.Tree.__setstate__                                                                       
  File "_tree.pyx", line 1573, in sklearn.tree._tree._check_node_ndarray                                                                    
ValueError: node array from the pickle has an incompatible dtype:                                                                           
- expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
- got     : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]

which most likely mean that the version used to train the GBReweighter and pickle it is different from the version I am using now. In practice this pickle file will be useless now, unless I can find the version I used. This is very tedious and dangerous, is there a way that the actual information, not the object, be saved to text?

arogozhnikov commented 1 month ago

Hi @acampove,

hep_ml did not change any fields for a while, instead we use sklearn's default serialization (which is pickle).

Differences come from changing sklearn version, in particular 1.2 <> 1.3, see this issue: https://github.com/scikit-learn/scikit-learn/issues/26798

It is a bit surprising that sklearn changed format (they never promise they won't, but also they try to keep this compatibility).

Unfortunately there is no simple way to fix this on hep_ml side, as one anyway needs some persistent format for sklearn trees, which isn't provided.

acampove commented 1 month ago

Hello @arogozhnikov

Thanks for your reply, I confirm that the problem was with the version of scikit-learn. I just created a virtual environment and tried:

Package         Version
--------------- -------
dill            0.3.9
hep-ml          0.7.2
joblib          1.4.2
numpy           1.26.4
pandas          2.2.3
pip             24.2
python-dateutil 2.9.0
pytz            2024.1
scikit-learn    1.2.2
scipy           1.14.1
setuptools      75.1.0
six             1.16.0
threadpoolctl   3.5.0
tzdata          2024.2
wheel           0.44.0

and it seems to unpickle it. By the way, dill seems to be also needed, but it's not installed as a requirement of hep_ml.

arogozhnikov commented 1 month ago

dill seems to be also needed

You shouldn't need dill; something in your env overrides pickle with dill (e.g. see in your traceback above that load_pickle somehow calls dill; pickle is system default and wouldn't fallback to dill).

arogozhnikov commented 1 month ago

anyway, glad you got it working