Closed juhoinkinen closed 11 months ago
Patch coverage: 100.00%
and no project coverage change.
Comparison is base (
40cc2fd
) 99.67% compared to head (e037b78
) 99.67%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
When updating to scikit-learn v1.3.0 old MLLM models cease to work:
...
File "/home/local/jmminkin/git/Annif/annif/backend/backend.py", line 142, in suggest
self.initialize()
File "/home/local/jmminkin/git/Annif/annif/backend/mllm.py", line 119, in initialize
self._model = self._load_model()
File "/home/local/jmminkin/git/Annif/annif/backend/mllm.py", line 102, in _load_model
return MLLMModel.load(path)
File "/home/local/jmminkin/git/Annif/annif/lexical/mllm.py", line 366, in load
return joblib.load(filename)
File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 658, in load
obj = _unpickle(fobj, filename, mmap_mode)
File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 577, in _unpickle
obj = unpickler.load()
File "/usr/lib/python3.8/pickle.py", line 1212, in load
dispatch[key[0]](self)
File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 402, in load_build
Unpickler.load_build(self)
File "/usr/lib/python3.8/pickle.py", line 1705, in load_build
setstate(state)
File "sklearn/tree/_tree.pyx", line 714, in sklearn.tree._tree.Tree.__setstate__
File "sklearn/tree/_tree.pyx", line 1418, in sklearn.tree._tree._check_node_ndarray
ValueError: node array from the pickle has an incompatible dtype:
- expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
- got : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]
Good catch re: scikit-learn and MLLM. There's not much we can do about it, I'm afraid. I think it's better to update now, before 1.0, instead of postponing the inevitable. But it has to be mentioned in the release notes.
The scikit-learn documentation on Model Persistence has some notes about compatibility between different environments and versions. It mentions the PMML and ONNX formats that could possibly be more durable than serializing sklearn models directly via joblib or pickle, as we currently do in MLLM. But that would be a whole new investigation. I think we should just consider this an unfortunate situation that we couldn't avoid.
Good catch re: scikit-learn and MLLM. There's not much we can do about it, I'm afraid. I think it's better to update now, before 1.0, instead of postponing the inevitable. But it has to be mentioned in the release notes.
The scikit-learn documentation on Model Persistence has some notes about compatibility between different environments and versions. It mentions the PMML and ONNX formats that could possibly be more durable than serializing sklearn models directly via joblib or pickle, as we currently do in MLLM. But that would be a whole new investigation. I think we should just consider this an unfortunate situation that we couldn't avoid.
Also old stwfsa models won't work with updated skikit-learn, the error message is the same as for MLLM models.
I'm seeing this new TensorFlow/Keras warning for test_backend_nn_ensemble.py:
UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')
It seems there is a new Keras format (.keras
) available, which is nowadays the recommended one for saving Keras models. We are using the legacy HDF5 (.h5
) format: https://github.com/NatLibFi/Annif/blob/40cc2fd0325b1c648abf4d436f4a05ee83c9599b/annif/backend/nn_ensemble.py#L100
What should we do about this for Annif 1.0?
.keras
on the above line).keras
but additionally add fallback code that can load existing .h5
models as wellI don't like 1, because it postpones an inevitable problem. Option 2 breaks compatibility with previously trained models, but provides a fresh start. Option 3 adds a few lines of extra code (and tests), while option 4 is more work but also "promises" to remove that extra code in the future.
Kudos, SonarCloud Quality Gate passed!
0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells
No Coverage information
0.0% Duplication
What should we do about this for Annif 1.0?
1. Nothing, just ignore the warning for now 2. Switch to the Keras format in a simplistic way (basically just switching the file extension to `.keras` on the above line) 3. Switch to `.keras` but additionally add fallback code that can load existing `.h5` models as well 4. Same as 3, but additionally add a deprecation warning that the fallback support will be removed in Annif 1.1.
I don't like 1, because it postpones an inevitable problem. Option 2 breaks compatibility with previously trained models, but provides a fresh start. Option 3 adds a few lines of extra code (and tests), while option 4 is more work but also "promises" to remove that extra code in the future.
Option 3 seems best to me. Better promise as little as possible.
Option 3 seems best to me. Better promise as little as possible.
I initially implemented option 3 in PR #730, but the fallback code (especially the unit test) got messy and then I realized that it's not much use supporting old pre-1.0 NN ensemble models when MLLM and STWFSA models will break anyway due to changes in scikit-learn.
Updates most of the outdated dependencies, and pins Flask and rdflib versions more tightly.
Exceptions to updating to most-recent releases:
Also closes #697 and #532.