Update dependencies v1.0

juhoinkinen commented 11 months ago

Updates most of the outdated dependencies, and pins Flask and rdflib versions more tightly.

Exceptions to updating to most-recent releases:

Flask is pinned to 2.2. instead 2.3. due to the requirement by Connexion v2.14.2
rdflib is pinned 6.3. instead of 7. due to the requirement by stwfsapy
Updating to numpy 1.25.* would require Python 3.9-3.11
Updating to scipy 1.11.* would require Python 3.9-3.11
~Updating to Optuna 3.3.0 (from 2.10.1) is left to wait for the possible implementation changes in Annif (see issue #532)~

Also closes #697 and #532.

codecov[bot] commented 11 months ago

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (40cc2fd) 99.67% compared to head (e037b78) 99.67%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #726 +/- ## ======================================= Coverage 99.67% 99.67% ======================================= Files 89 89 Lines 6397 6397 ======================================= Hits 6376 6376 Misses 21 21 ``` | [Files Changed](https://app.codecov.io/gh/NatLibFi/Annif/pull/726?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi) | Coverage Δ | | |---|---|---| | [annif/backend/ensemble.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/726?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9lbnNlbWJsZS5weQ==) | `100.00% <ø> (ø)` | | | [annif/backend/svc.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/726?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9zdmMucHk=) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

juhoinkinen commented 11 months ago

When updating to scikit-learn v1.3.0 old MLLM models cease to work:

...

  File "/home/local/jmminkin/git/Annif/annif/backend/backend.py", line 142, in suggest
    self.initialize()
  File "/home/local/jmminkin/git/Annif/annif/backend/mllm.py", line 119, in initialize
    self._model = self._load_model()
  File "/home/local/jmminkin/git/Annif/annif/backend/mllm.py", line 102, in _load_model
    return MLLMModel.load(path)
  File "/home/local/jmminkin/git/Annif/annif/lexical/mllm.py", line 366, in load
    return joblib.load(filename)
  File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 658, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 577, in _unpickle
    obj = unpickler.load()
  File "/usr/lib/python3.8/pickle.py", line 1212, in load
    dispatch[key[0]](self)
  File "/home/local/jmminkin/.cache/pypoetry/virtualenvs/annif-ul-EXdhi-py3.8/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 402, in load_build
    Unpickler.load_build(self)
  File "/usr/lib/python3.8/pickle.py", line 1705, in load_build
    setstate(state)
  File "sklearn/tree/_tree.pyx", line 714, in sklearn.tree._tree.Tree.__setstate__
  File "sklearn/tree/_tree.pyx", line 1418, in sklearn.tree._tree._check_node_ndarray
ValueError: node array from the pickle has an incompatible dtype:
- expected: {'names': ['left_child', 'right_child', 'feature', 'threshold', 'impurity', 'n_node_samples', 'weighted_n_node_samples', 'missing_go_to_left'], 'formats': ['<i8', '<i8', '<i8', '<f8', '<f8', '<i8', '<f8', 'u1'], 'offsets': [0, 8, 16, 24, 32, 40, 48, 56], 'itemsize': 64}
- got     : [('left_child', '<i8'), ('right_child', '<i8'), ('feature', '<i8'), ('threshold', '<f8'), ('impurity', '<f8'), ('n_node_samples', '<i8'), ('weighted_n_node_samples', '<f8')]

osma commented 11 months ago

Good catch re: scikit-learn and MLLM. There's not much we can do about it, I'm afraid. I think it's better to update now, before 1.0, instead of postponing the inevitable. But it has to be mentioned in the release notes.

The scikit-learn documentation on Model Persistence has some notes about compatibility between different environments and versions. It mentions the PMML and ONNX formats that could possibly be more durable than serializing sklearn models directly via joblib or pickle, as we currently do in MLLM. But that would be a whole new investigation. I think we should just consider this an unfortunate situation that we couldn't avoid.

juhoinkinen commented 11 months ago

Good catch re: scikit-learn and MLLM. There's not much we can do about it, I'm afraid. I think it's better to update now, before 1.0, instead of postponing the inevitable. But it has to be mentioned in the release notes.

The scikit-learn documentation on Model Persistence has some notes about compatibility between different environments and versions. It mentions the PMML and ONNX formats that could possibly be more durable than serializing sklearn models directly via joblib or pickle, as we currently do in MLLM. But that would be a whole new investigation. I think we should just consider this an unfortunate situation that we couldn't avoid.

Also old stwfsa models won't work with updated skikit-learn, the error message is the same as for MLLM models.

osma commented 11 months ago

I'm seeing this new TensorFlow/Keras warning for test_backend_nn_ensemble.py:

UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')

It seems there is a new Keras format (.keras) available, which is nowadays the recommended one for saving Keras models. We are using the legacy HDF5 (.h5) format: https://github.com/NatLibFi/Annif/blob/40cc2fd0325b1c648abf4d436f4a05ee83c9599b/annif/backend/nn_ensemble.py#L100

What should we do about this for Annif 1.0?

Nothing, just ignore the warning for now
Switch to the Keras format in a simplistic way (basically just switching the file extension to .keras on the above line)
Switch to .keras but additionally add fallback code that can load existing .h5 models as well
Same as 3, but additionally add a deprecation warning that the fallback support will be removed in Annif 1.1.

I don't like 1, because it postpones an inevitable problem. Option 2 breaks compatibility with previously trained models, but provides a fresh start. Option 3 adds a few lines of extra code (and tests), while option 4 is more work but also "promises" to remove that extra code in the future.

sonarcloud[bot] commented 11 months ago

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

juhoinkinen commented 11 months ago

What should we do about this for Annif 1.0?
1. Nothing, just ignore the warning for now

2. Switch to the Keras format in a simplistic way (basically just switching the file extension to `.keras` on the above line)

3. Switch to `.keras` but additionally add fallback code that can load existing `.h5` models as well

4. Same as 3, but additionally add a deprecation warning that the fallback support will be removed in Annif 1.1.
I don't like 1, because it postpones an inevitable problem. Option 2 breaks compatibility with previously trained models, but provides a fresh start. Option 3 adds a few lines of extra code (and tests), while option 4 is more work but also "promises" to remove that extra code in the future.

Option 3 seems best to me. Better promise as little as possible.

osma commented 11 months ago

Option 3 seems best to me. Better promise as little as possible.

I initially implemented option 3 in PR #730, but the fallback code (especially the unit test) got messy and then I realized that it's not much use supporting old pre-1.0 NN ensemble models when MLLM and STWFSA models will break anyway due to changes in scikit-learn.

NatLibFi / Annif

Update dependencies v1.0 #726

Codecov Report