facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.22k stars 6.38k forks source link

command issue - fairseq-hydra-train continue pretraining from checkpoint #5127

Open flckv opened 1 year ago

flckv commented 1 year ago

🐛 : fairseq-hydra-train command for continue Pretraining Wav2vec:

How to specify the last checkpoint from the previous pretraining with fairseq-hydra-train command? @vineelpratap @androstj

I tried --continue-once or --restore-file ../checkpoint_last.pt or I did not add anything but then pretraining started from all over from epoch one and it said that no check-point was read but none of them worked

To Reproduce

Steps to reproduce the behavior:

I followed the steps in the doc: https://github.com/facebookresearch/fairseq/blob/main/examples/wav2vec/README.md:

fairseq-hydra-train \
    task.data=/path/to/data \
    --config-dir /path/to/fairseq-py/examples/wav2vec/config/pretraining \
    --config-name wav2vec2_base_librispeech

adjusted to: srun fairseq-hydra-train distributed_training.distributed_world_size=4 +optimization.update_freq='[16]'--restore-file /home/flck/outputs/2023-05-22/07-19-53/checkpoints/checkpoint_last.pt --config-dir /home/flck/fairseq/examples/wav2vec/config/pretraining --config-name wav2vec2_base_librispeech

I tried to continue pertaining with the command suggested here: but receive ERROR:

fairseq-hydra-train: error: unrecognized arguments: --restore-file /home/user/outputs/2023-05-22/07-19-53/checkpoints/checkpoint_last.pt

when tried to continue pretraining wav2vec

Expected behavior

In the first 12 hours pertaining reached around 16.epochs. I wanted to continue the pretraining from the last saved checkpoint for another 12 hours.

Environment

SBATCH --job-name=ol # Job name

SBATCH --output=/home/flck/output_.%A.txt # Standard output and error log

SBATCH --nodes=2 # Run all processes on a single node

SBATCH --ntasks=4 # Run on a single CPU

SBATCH --mem=128G # Total RAM to be used

SBATCH --cpus-per-task=4 # Number of CPU cores #--cpus-per-task=8

SBATCH --gres=gpu:4 # Number of GPUs (per node) #151966: 4

SBATCH -p gpu # Use the gpu partition

SBATCH --time=12:00:00 # Specify the time needed for your experiment

SBATCH --qos=gpu-8 # To enable the use of up to 8 GPUs

Additional context: hydra_train.log file:

) [2023-05-22 07:20:12,939][fairseq_cli.train][INFO] - task: AudioPretrainingTask [2023-05-22 07:20:12,939][fairseq_cli.train][INFO] - model: Wav2Vec2Model [2023-05-22 07:20:12,939][fairseq_cli.train][INFO] - criterion: Wav2vecCriterion [2023-05-22 07:20:12,941][fairseq_cli.train][INFO] - num. shared model params: 95,044,608 (num. trained: 95,044,608) [2023-05-22 07:20:12,943][fairseq_cli.train][INFO] - num. expert model params: 0 (num. trained: 0) [2023-05-22 07:20:12,947][fairseq.data.audio.raw_audio_dataset][INFO] - loaded 1240, skipped 26 samples [2023-05-22 07:20:16,081][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:2 to store for rank: 0 [2023-05-22 07:20:16,081][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:2 with 4 nodes. [2023-05-22 07:20:16,082][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.1.0.bias [2023-05-22 07:20:16,082][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.2.0.bias [2023-05-22 07:20:16,082][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.3.0.bias [2023-05-22 07:20:16,082][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.4.0.bias [2023-05-22 07:20:16,082][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.5.0.bias [2023-05-22 07:20:16,082][fairseq.trainer][INFO] - detected shared parameter: feature_extractor.conv_layers.0.0.bias <- feature_extractor.conv_layers.6.0.bias [2023-05-22 07:20:18,068][fairseq.utils][INFO] - CUDA enviroments for all 4 workers [2023-05-22 07:20:18,068][fairseq.utils][INFO] - rank 0: capabilities = 8.0 ; total memory = 39.586 GB ; name = NVIDIA A100-SXM4-40GB
[2023-05-22 07:20:18,068][fairseq.utils][INFO] - rank 1: capabilities = 8.0 ; total memory = 39.586 GB ; name = NVIDIA A100-SXM4-40GB
[2023-05-22 07:20:18,068][fairseq.utils][INFO] - rank 2: capabilities = 8.0 ; total memory = 39.586 GB ; name = NVIDIA A100-SXM4-40GB
[2023-05-22 07:20:18,068][fairseq.utils][INFO] - rank 3: capabilities = 8.0 ; total memory = 39.586 GB ; name = NVIDIA A100-SXM4-40GB
[2023-05-22 07:20:18,068][fairseq.utils][INFO] - CUDA enviroments for all 4 workers [2023-05-22 07:20:18,069][fairseq_cli.train][INFO] - training on 4 devices (GPUs/TPUs) [2023-05-22 07:20:18,069][fairseq_cli.train][INFO] - max tokens per device = 1400000 and max sentences per device = None [2023-05-22 07:20:18,070][fairseq.trainer][INFO] - Preparing to load checkpoint checkpoints/checkpoint_last.pt [2023-05-22 07:20:18,071][fairseq.trainer][INFO] - No existing checkpoint found checkpoints/checkpoint_last.pt

orena1 commented 1 year ago

Hi @flckv , It needs to be checkpoint.restore_file=../checkpoint_last.pt and not --restore-file ../checkpoint_last.pt Good luck