Closed juhoinkinen closed 11 months ago
Patch coverage: 100.00%
and no project coverage change.
Comparison is base (
02f1533
) 99.67% compared to head (54f4136
) 99.67%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Kudos, SonarCloud Quality Gate passed!
0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell
No Coverage information
0.0% Duplication
To make the state of projects more coherent, the first step of the training could be to remove old data files. Then there would not be the problem about the modification time of the vectorizer
file (whose update without updating the rest of the model could also make the project not working either silently or at all, because of wrong index): it would be clear that the vectorizer
file would not affect the modification time or train state of the project.
This would also keep the project's datadir clean of any temp files hanging around.
However, this would also mean a less gracious behavior, in the sense that if retraining a working project fails, then the project is gone, whereas now the old project could remain working (depending on the step which fails).
If initial training of a project is not finished after any files have been created in the project's data directory, the train state and modification time information turn out incorrect (the project shows to be (fully) trained when it is not, and with a modification time). And when retraining a project is interrupted, the modification time is falsely updated.
This problem can realize more commonly when/if implementing the
--prepare-only
option to the train command.This PR makes the methods inquiring the train state and modification time to ignore files in the project's datadir with pattern
*-train*
,tmp-*
andvectorizer
. Thetmp-
prefix is added to all temporary files, because some backends are using a tempfile for the model file during training, which can remain after unfinished training, e.g.stwfsa_predictor1mz8z4im.zip
.The train and temp files should definitely be ignored, but the vectorizer file case is not so clear:
Instead of using global ignore patterns, this functionality could use the actual model file names/patterns per backend, but the field storing them varies a bit (
MODEL_FILE
,INDEX_FILE
,MODEL_FILE_PREFIX
).