coqui-ai / STT

🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy.
https://coqui.ai
Mozilla Public License 2.0
2.23k stars 270 forks source link

Improvment: `NotFoundError`: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for `best_dev_checkpoint` #2338

Open wasertech opened 1 year ago

wasertech commented 1 year ago

Trying to optimize my LM but lm_optimizer.py throws NotFoundError as environment has CuDNN disabled.

Checkpoint loading failed due to missing tensors, retrying with --load_cudnn true - You should specify this flag whenever loading a checkpoint that was created with --train_cudnn true in an environment that has CuDNN disabled.

I want to use my GPU --'

FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:~optuna.trial.Trial.suggest_float instead.

Related?

NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/checkpoints/best_dev-221133

I have a bad feeling about this one.

+ python -u /home/trainer/lm_optimizer.py --show_progressbar true --train_cudnn true --alphabet_config_path /mnt/models/fr/alphabet.txt --scorer_path /mnt/lm/fr/kenlm.scorer --feature_cache /mnt/sources/fr/feature_cache --test_files /mnt/extracted/fr/data/Assistant/train_test.csv --test_batch_size 64 --n_hidden 2048 --lm_alpha_max 2 --lm_beta_max 4 --n_trials 50 --checkpoint_dir /transfer-checkpoint
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
[I 2023-01-22 23:18:04,503] A new study created in memory with name: no-name-0f421b63-297c-468c-b30d-8aa59857a843
/home/trainer/stt/training/coqui_stt_training/util/lm_optimize.py:30: FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:`~optuna.trial.Trial.suggest_float` instead.
  Config.lm_alpha = trial.suggest_uniform("lm_alpha", 0, Config.lm_alpha_max)
/home/trainer/stt/training/coqui_stt_training/util/lm_optimize.py:31: FutureWarning: suggest_uniform has been deprecated in v3.0.0. This feature will be removed in v6.0.0. See https://github.com/optuna/optuna/releases/tag/v3.0.0. Use :func:`~optuna.trial.Trial.suggest_float` instead.
  Config.lm_beta = trial.suggest_uniform("lm_beta", 0, Config.lm_beta_max)
I Loading best validating checkpoint from /mnt/checkpoints/best_dev-221133
W Checkpoint loading failed due to missing tensors, retrying with --load_cudnn true - You should specify this flag whenever loading a checkpoint that was created with --train_cudnn true in an environment that has CuDNN disabled.
[W 2023-01-22 23:18:05,201] Trial 0 failed with parameters: {'lm_alpha': 0.26985826312830485, 'lm_beta': 1.3371065634850314} because of the following error: NotFoundError().
Traceback (most recent call last):
  File "/home/trainer/stt/training/coqui_stt_training/util/checkpoints.py", line 121, in _load_checkpoint
    return _load_checkpoint_impl(
  File "/home/trainer/stt/training/coqui_stt_training/util/checkpoints.py", line 21, in _load_checkpoint_impl
    ckpt = tfv1.train.load_checkpoint(checkpoint_path)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/checkpoint_utils.py", line 66, in load_checkpoint
    return pywrap_tensorflow.NewCheckpointReader(filename)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 873, in NewCheckpointReader
    return CheckpointReader(compat.as_bytes(filepattern))
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 885, in __init__
    this = _pywrap_tensorflow_internal.new_CheckpointReader(filename)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/checkpoints/best_dev-221133

To Reproduce Steps to reproduce the behavior: Full logs

Expected behavior A study should start on the GPU for 50 trails.

Environment (please complete the following information): Docker

Additional context Built using the Training Wizard for STT

wasertech commented 1 year ago

I think /mnt/checkpoints/best_dev-221133 doesn't exist but can't seem to find we it comes from... checkpoint file is in /transfer-checkpoint.

wasertech commented 1 year ago

Yes it's /transfer-checkpoint/best_dev_checkpoint pointing to /mnt/checkpoints/best_dev-221133:

# /transfer-checkpoint/best_dev_checkpoint
model_checkpoint_path: "/mnt/checkpoints/best_dev-221133"
all_model_checkpoint_paths: "/mnt/checkpoints/best_dev-221133"

lm_optimizer should probably expect tensorflow.python.framework.errors_impl.NotFoundError here: https://github.com/coqui-ai/STT/blob/a694187be4817870e53f5e14b24e16b57dfaa581/training/coqui_stt_training/util/lm_optimize.py#L39 Or directly when computing results, in main: https://github.com/coqui-ai/STT/blob/a694187be4817870e53f5e14b24e16b57dfaa581/training/coqui_stt_training/util/lm_optimize.py#L86-L93 Something like:

import sys
...
from tensorflow.python.framework.errors_impl import NotFoundError
...
try:
    results = compute_lm_optimization()
    print(
        "Best params: lm_alpha={} and lm_beta={} with WER={}".format(
            results.get("lm_alpha"),
            results.get("lm_beta"),
            results.get("wer"),
        )
    )
expect NotFoundError as e:
    print("Your checkpoint  /transfer-checkpoint/best_dev_checkpoint points to an empty checkpoint file /mnt/checkpoints/best_dev-221133\nMake sure you give a valid --checkpoint_dir path.")
    sys.exit(1)

Note: need to find variables holding /transfer-checkpoint/best_dev_checkpoint and /mnt/checkpoints/best_dev-221133. (filename and checkpoint_path?)