NatLibFi / Annif

Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums.
https://annif.org
Other
191 stars 41 forks source link

Fix train state and modification time for unfinished project training #722

Closed juhoinkinen closed 11 months ago

juhoinkinen commented 11 months ago

If initial training of a project is not finished after any files have been created in the project's data directory, the train state and modification time information turn out incorrect (the project shows to be (fully) trained when it is not, and with a modification time). And when retraining a project is interrupted, the modification time is falsely updated.

This problem can realize more commonly when/if implementing the --prepare-only option to the train command.

This PR makes the methods inquiring the train state and modification time to ignore files in the project's datadir with pattern *-train*, tmp-* and vectorizer. The tmp- prefix is added to all temporary files, because some backends are using a tempfile for the model file during training, which can remain after unfinished training, e.g. stwfsa_predictor1mz8z4im.zip.

The train and temp files should definitely be ignored, but the vectorizer file case is not so clear:

Instead of using global ignore patterns, this functionality could use the actual model file names/patterns per backend, but the field storing them varies a bit (MODEL_FILE, INDEX_FILE, MODEL_FILE_PREFIX).

codecov[bot] commented 11 months ago

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (02f1533) 99.67% compared to head (54f4136) 99.67%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #722 +/- ## ======================================= Coverage 99.67% 99.67% ======================================= Files 89 89 Lines 6380 6397 +17 ======================================= + Hits 6359 6376 +17 Misses 21 21 ``` | [Files Changed](https://app.codecov.io/gh/NatLibFi/Annif/pull/722?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi) | Coverage Δ | | |---|---|---| | [annif/backend/backend.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/722?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvYmFja2VuZC9iYWNrZW5kLnB5) | `100.00% <100.00%> (ø)` | | | [annif/util.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/722?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-YW5uaWYvdXRpbC5weQ==) | `98.48% <100.00%> (+0.02%)` | :arrow_up: | | [tests/test\_project.py](https://app.codecov.io/gh/NatLibFi/Annif/pull/722?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=NatLibFi#diff-dGVzdHMvdGVzdF9wcm9qZWN0LnB5) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

sonarcloud[bot] commented 11 months ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

juhoinkinen commented 11 months ago

To make the state of projects more coherent, the first step of the training could be to remove old data files. Then there would not be the problem about the modification time of the vectorizer file (whose update without updating the rest of the model could also make the project not working either silently or at all, because of wrong index): it would be clear that the vectorizer file would not affect the modification time or train state of the project.

This would also keep the project's datadir clean of any temp files hanging around.

However, this would also mean a less gracious behavior, in the sense that if retraining a working project fails, then the project is gone, whereas now the old project could remain working (depending on the step which fails).