MEGAMOLBART_CONT_I=nvcr.io#nvidia/clara/megamolbart:0.2.0_temp
MEGAMOLBART_CONT=nvcr.io/nvidia/clara/megamolbart:0.2.0_temp
DATA_PATH="/xxx/clara_discovery/init_data"
WANDB_API_KEY="xxxx"
MEGAMOLBART_HOMEPATH="/xxx/clara_discovery/MegaMolBART"
WANDB_OFFLINE="FALSE" # Set to FALSE to upload to WandB during training
It works for training. However, when I would like to continue from a checkpoint. Re-execute the script above will leads to the following error:
`[NeMo W 2022-11-28 18:18:51 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:611: UserWarning: Checkpoint directory /result/nemo_experiments/MegaMolBART/xsmall_span_aug_promethous_pretraining/checkpoints exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
Error executing job with overrides: ['++trainer.num_nodes=1', 'exp_manager.wandb_logger_kwargs.offline=FALSE']
Traceback (most recent call last):
File "megamolbart_pretrain.py", line 110, in main
trainer.fit(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, *kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1174, in _run
self._call_setup_hook() # allow user to setup lightning_module in accelerator environment
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_setup_hook
self._call_lightning_module_hook("setup", stage=fn)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook
output = fn(args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 803, in setup
self.setup_training_data(self._cfg.data)
File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 816, in setup_training_data
self._train_dl = self.build_pretraining_data_loader(self._train_ds, consumed_samples)
File "/workspace/nemo_chem/nemo_chem/models/megamolbart/megamolbart_model.py", line 97, in build_pretraining_data_loader
dataloader = super().build_pretraining_data_loader(dataset=dataset, consumed_samples=consumed_samples)
File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 754, in build_pretraining_data_loader
batch_sampler = MegatronPretrainingBatchSampler(
File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/megatron/megatron_batch_samplers.py", line 79, in init
raise RuntimeError("no samples left to consume: {}, {}".format(consumed_samples, total_samples))
RuntimeError: no samples left to consume: 4123392, 294549`
I have trained a MegaMolBART from scratch with the following script:
`#!/bin/bash
SBATCH --nodes 1
SBATCH --ntasks-per-node 8
SBATCH --time=4:00:00
SBATCH --partition PARITITION_NAME
SBATCH --job-name megamolbart_train
set -x
docker load</gpfs/fs1/jzhai/clara_discovery/MegaMolBART/megamolbart_V100.tar
MEGAMOLBART_CONT_I=nvcr.io#nvidia/clara/megamolbart:0.2.0_temp MEGAMOLBART_CONT=nvcr.io/nvidia/clara/megamolbart:0.2.0_temp DATA_PATH="/xxx/clara_discovery/init_data" WANDB_API_KEY="xxxx" MEGAMOLBART_HOMEPATH="/xxx/clara_discovery/MegaMolBART" WANDB_OFFLINE="FALSE" # Set to FALSE to upload to WandB during training
MOUNTS="$DATA_PATH:/data,${DATA_PATH}/result:/result,${MEGAMOLBART_HOMEPATH}:/workspace/nemo_chem"
docker push $MEGAMOLBART_CONT
enroot import -o test.sqsh dockerd://${MEGAMOLBART_CONT}
srun \ --output slurm-%j-%n.out \ --error error-%j-%n.out \ --container-image ./test.sqsh \ --container-mounts ${MOUNTS} \ --container-workdir /workspace/nemo_chem/examples/chem \ --export WANDB_API_KEY="${WANDB_API_KEY}" \ python megamolbart_pretrain.py \ --config-path=conf \ --config-name=megamolbart_pretrain_xsmall_span_aug \ ++trainer.num_nodes=${SLURM_JOB_NUM_NODES} \ ++trainer.gpus=${SLURM_NTASKS_PER_NODE} \ exp_manager.wandb_logger_kwargs.offline=${WANDB_OFFLINE}`
It works for training. However, when I would like to continue from a checkpoint. Re-execute the script above will leads to the following error:
`[NeMo W 2022-11-28 18:18:51 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:611: UserWarning: Checkpoint directory /result/nemo_experiments/MegaMolBART/xsmall_span_aug_promethous_pretraining/checkpoints exists and is not empty. rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
Error executing job with overrides: ['++trainer.num_nodes=1', 'exp_manager.wandb_logger_kwargs.offline=FALSE'] Traceback (most recent call last): File "megamolbart_pretrain.py", line 110, in main trainer.fit(model) File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit self._call_and_handle_interrupt( File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1174, in _run self._call_setup_hook() # allow user to setup lightning_module in accelerator environment File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_setup_hook self._call_lightning_module_hook("setup", stage=fn) File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook output = fn(args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 803, in setup self.setup_training_data(self._cfg.data) File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 816, in setup_training_data self._train_dl = self.build_pretraining_data_loader(self._train_ds, consumed_samples) File "/workspace/nemo_chem/nemo_chem/models/megamolbart/megamolbart_model.py", line 97, in build_pretraining_data_loader dataloader = super().build_pretraining_data_loader(dataset=dataset, consumed_samples=consumed_samples) File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 754, in build_pretraining_data_loader batch_sampler = MegatronPretrainingBatchSampler( File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/megatron/megatron_batch_samplers.py", line 79, in init raise RuntimeError("no samples left to consume: {}, {}".format(consumed_samples, total_samples)) RuntimeError: no samples left to consume: 4123392, 294549`