NVIDIA / MegaMolBART

A deep learning model for small molecule drug discovery and cheminformatics based on SMILES
134 stars 24 forks source link

Fine tune based on a checkpoint. #6

Open linmuchuiyang opened 1 year ago

linmuchuiyang commented 1 year ago

I have trained a MegaMolBART from scratch with the following script:

`#!/bin/bash

SBATCH --nodes 1

SBATCH --ntasks-per-node 8

SBATCH --time=4:00:00

SBATCH --partition PARITITION_NAME

SBATCH --job-name megamolbart_train

set -x

docker load</gpfs/fs1/jzhai/clara_discovery/MegaMolBART/megamolbart_V100.tar

MEGAMOLBART_CONT_I=nvcr.io#nvidia/clara/megamolbart:0.2.0_temp MEGAMOLBART_CONT=nvcr.io/nvidia/clara/megamolbart:0.2.0_temp DATA_PATH="/xxx/clara_discovery/init_data" WANDB_API_KEY="xxxx" MEGAMOLBART_HOMEPATH="/xxx/clara_discovery/MegaMolBART" WANDB_OFFLINE="FALSE" # Set to FALSE to upload to WandB during training

MOUNTS="$DATA_PATH:/data,${DATA_PATH}/result:/result,${MEGAMOLBART_HOMEPATH}:/workspace/nemo_chem"

docker push $MEGAMOLBART_CONT

enroot import -o test.sqsh dockerd://${MEGAMOLBART_CONT}

srun \ --output slurm-%j-%n.out \ --error error-%j-%n.out \ --container-image ./test.sqsh \ --container-mounts ${MOUNTS} \ --container-workdir /workspace/nemo_chem/examples/chem \ --export WANDB_API_KEY="${WANDB_API_KEY}" \ python megamolbart_pretrain.py \ --config-path=conf \ --config-name=megamolbart_pretrain_xsmall_span_aug \ ++trainer.num_nodes=${SLURM_JOB_NUM_NODES} \ ++trainer.gpus=${SLURM_NTASKS_PER_NODE} \ exp_manager.wandb_logger_kwargs.offline=${WANDB_OFFLINE}`

It works for training. However, when I would like to continue from a checkpoint. Re-execute the script above will leads to the following error:

`[NeMo W 2022-11-28 18:18:51 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:611: UserWarning: Checkpoint directory /result/nemo_experiments/MegaMolBART/xsmall_span_aug_promethous_pretraining/checkpoints exists and is not empty. rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")

Error executing job with overrides: ['++trainer.num_nodes=1', 'exp_manager.wandb_logger_kwargs.offline=FALSE'] Traceback (most recent call last): File "megamolbart_pretrain.py", line 110, in main trainer.fit(model) File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit self._call_and_handle_interrupt( File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt return trainer_fn(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1174, in _run self._call_setup_hook() # allow user to setup lightning_module in accelerator environment File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_setup_hook self._call_lightning_module_hook("setup", stage=fn) File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1595, in _call_lightning_module_hook output = fn(args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 803, in setup self.setup_training_data(self._cfg.data) File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 816, in setup_training_data self._train_dl = self.build_pretraining_data_loader(self._train_ds, consumed_samples) File "/workspace/nemo_chem/nemo_chem/models/megamolbart/megamolbart_model.py", line 97, in build_pretraining_data_loader dataloader = super().build_pretraining_data_loader(dataset=dataset, consumed_samples=consumed_samples) File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py", line 754, in build_pretraining_data_loader batch_sampler = MegatronPretrainingBatchSampler( File "/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/megatron/megatron_batch_samplers.py", line 79, in init raise RuntimeError("no samples left to consume: {}, {}".format(consumed_samples, total_samples)) RuntimeError: no samples left to consume: 4123392, 294549`