NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
190 stars 41 forks source link

Update dependencies v1.0 #726

Closed juhoinkinen closed 11 months ago

juhoinkinen commented 11 months ago

Updates most of the outdated dependencies, and pins Flask and rdflib versions more tightly.

Exceptions to updating to most-recent releases:

Also closes #697 and #532.

codecov[bot] commented 11 months ago

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (40cc2fd) 99.67% compared to head (e037b78) 99.67%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #726 +/- ## ======================================= Coverage 99.67% 99.67% ======================================= Files 89 89 Lines 6397 6397 ======================================= Hits 6376 6376 Misses 21 21 ``` | [Files Changed](https://app.codecov.io/gh/NatLibFi/Annif/pull/726?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi) | Coverage Δ | | |---|---|---| | [annif/backend/ensemble.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/726?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9lbnNlbWJsZS5weQ==) | `100.00% <ø> (ø)` | | | [annif/backend/svc.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/726?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9zdmMucHk=) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

juhoinkinen commented 11 months ago

When updating to scikit-learn v1.3.0 old MLLM models cease to work:

...

  File "/home/local/jmminkin/git/Annif/annif/backend/backend.py", line 142, in suggest
    self.initialize()
  File "/home/local/jmminkin/git/Annif/annif/backend/mllm.py", line 119, in initialize
    self._model = self._load_model()
  File "/home/local/jmminkin/git/Annif/annif/backend/mllm.py", line 102, in _load_model
    return MLLMModel.load(path)
  File "/home/local/jmminkin/git/Annif/annif/lexical/mllm.py", line 366, in load
    return joblib.load(filename)
  File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 658, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 577, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.8/pickle.py", line 1212, in load
    dispatch[key[0]](self)
  File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 402, in load_build
    Unpickler.load_build(self)
  File "/usr/lib/python3.8/pickle.py", line 1705, in load_build
    setstate(state)
  File "sklearn/tree/_tree.pyx", line 714, in sklearn.tree._tree.Tree.__setstate__
  File "sklearn/tree/_tree.pyx", line 1418, in sklearn.tree._tree._check_node_ndarray
ValueError: node array from the pickle has an incompatible dtype:
- expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
- got     : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]
osma commented 11 months ago

Good catch re: scikit-learn and MLLM. There's not much we can do about it, I'm afraid. I think it's better to update now, before 1.0, instead of postponing the inevitable. But it has to be mentioned in the release notes.

The scikit-learn documentation on Model Persistence has some notes about compatibility between different environments and versions. It mentions the PMML and ONNX formats that could possibly be more durable than serializing sklearn models directly via joblib or pickle, as we currently do in MLLM. But that would be a whole new investigation. I think we should just consider this an unfortunate situation that we couldn't avoid.

juhoinkinen commented 11 months ago

Good catch re: scikit-learn and MLLM. There's not much we can do about it, I'm afraid. I think it's better to update now, before 1.0, instead of postponing the inevitable. But it has to be mentioned in the release notes.

The scikit-learn documentation on Model Persistence has some notes about compatibility between different environments and versions. It mentions the PMML and ONNX formats that could possibly be more durable than serializing sklearn models directly via joblib or pickle, as we currently do in MLLM. But that would be a whole new investigation. I think we should just consider this an unfortunate situation that we couldn't avoid.

Also old stwfsa models won't work with updated skikit-learn, the error message is the same as for MLLM models.

osma commented 11 months ago

I'm seeing this new TensorFlow/Keras warning for test_backend_nn_ensemble.py:

UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')

It seems there is a new Keras format (.keras) available, which is nowadays the recommended one for saving Keras models. We are using the legacy HDF5 (.h5) format: https://github.com/NatLibFi/Annif/blob/40cc2fd0325b1c648abf4d436f4a05ee83c9599b/annif/backend/nn_ensemble.py#L100

What should we do about this for Annif 1.0?

  1. Nothing, just ignore the warning for now
  2. Switch to the Keras format in a simplistic way (basically just switching the file extension to .keras on the above line)
  3. Switch to .keras but additionally add fallback code that can load existing .h5 models as well
  4. Same as 3, but additionally add a deprecation warning that the fallback support will be removed in Annif 1.1.

I don't like 1, because it postpones an inevitable problem. Option 2 breaks compatibility with previously trained models, but provides a fresh start. Option 3 adds a few lines of extra code (and tests), while option 4 is more work but also "promises" to remove that extra code in the future.

sonarcloud[bot] commented 11 months ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

juhoinkinen commented 11 months ago

What should we do about this for Annif 1.0?

1. Nothing, just ignore the warning for now

2. Switch to the Keras format in a simplistic way (basically just switching the file extension to `.keras` on the above line)

3. Switch to `.keras` but additionally add fallback code that can load existing `.h5` models as well

4. Same as 3, but additionally add a deprecation warning that the fallback support will be removed in Annif 1.1.

I don't like 1, because it postpones an inevitable problem. Option 2 breaks compatibility with previously trained models, but provides a fresh start. Option 3 adds a few lines of extra code (and tests), while option 4 is more work but also "promises" to remove that extra code in the future.

Option 3 seems best to me. Better promise as little as possible.

osma commented 11 months ago

Option 3 seems best to me. Better promise as little as possible.

I initially implemented option 3 in PR #730, but the fallback code (especially the unit test) got messy and then I realized that it's not much use supporting old pre-1.0 NN ensemble models when MLLM and STWFSA models will break anyway due to changes in scikit-learn.