atomistic-machine-learning / schnetpack

SchNetPack - Deep Neural Networks for Atomistic Systems
Other
789 stars 215 forks source link

Problem with running model training on GPU" #671

Closed SoFarN closed 1 week ago

SoFarN commented 1 week ago

Hi, I recently started working with SchNetPack. I am going to run the first Example which is about training on QM9 on GPU. But I get errors. This is the main part of submission file:

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-socket=2
#SBATCH --mem-per-cpu=5gb
#SBATCH --time=4-00:00:00
#SBATCH --distribution=cyclic:cyclic
source ~/anaconda3/etc/profile.d/conda.sh
conda activate schnet-env
pwd; hostname; date
spktrain experiment=qm9_atomwise
date

The slurm error file:

/blue/mingjieliu/so.farajinafchi/anaconda3/envs/schnet-env/lib/python3.10/site-packages/hydra/_internal/config_loader_impl.py:216: UserWarning: provider=hydra.searchpath in main, path=/blue/mingjieliu/so.farajinafchi/practice_MLIP/schnetpack/spk_workdir/test_GPU/configs is not available.
  warnings.warn(
[rank: 0] Seed set to 306449896
/blue/mingjieliu/so.farajinafchi/anaconda3/envs/schnet-env/lib/python3.10/site-packages/pytorch_lightning/utilities/parsing.py:208: Attribute 'model' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['model'])`.
Error executing job with overrides: ['experiment=qm9_atomwise']
Error in call to target 'pytorch_lightning.trainer.trainer.Trainer':
RuntimeError('You set `--ntasks=4` in your SLURM bash script, but this variable is not supported. HINT: Use `--ntasks-per-node=4` instead.')
full_key: trainer

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

An empty data folder and a run folder with cli.log and config.yaml files are created.

**cat runs/69eb6452-9d1a-11ef-ba75-5cff35fba093/cli.log** 
[2024-11-07 10:10:24,376][schnetpack.cli][INFO] - Running on host: c0800a-s11.ufhpc
[2024-11-07 10:10:24,387][schnetpack.cli][INFO] - Seed randomly with <306449896>
[2024-11-07 10:10:24,445][schnetpack.cli][INFO] - Instantiating datamodule <schnetpack.datasets.QM9>
[2024-11-07 10:10:24,679][schnetpack.cli][INFO] - Instantiating model <schnetpack.model.NeuralNetworkPotential>
[2024-11-07 10:10:26,985][schnetpack.cli][INFO] - Instantiating task <schnetpack.AtomisticTask>
[2024-11-07 10:10:27,521][schnetpack.cli][INFO] - Instantiating callback <schnetpack.train.ModelCheckpoint>
[2024-11-07 10:10:27,622][schnetpack.cli][INFO] - Instantiating callback <pytorch_lightning.callbacks.EarlyStopping>
[2024-11-07 10:10:27,623][schnetpack.cli][INFO] - Instantiating callback <pytorch_lightning.callbacks.LearningRateMonitor>
[2024-11-07 10:10:27,623][schnetpack.cli][INFO] - Instantiating callback <schnetpack.train.ExponentialMovingAverage>
[2024-11-07 10:10:27,624][schnetpack.cli][INFO] - Instantiating logger <pytorch_lightning.loggers.tensorboard.TensorBoardLogger>
[2024-11-07 10:10:27,626][schnetpack.cli][INFO] - Instantiating trainer <pytorch_lightning.Trainer>

Thank you for your help.

stefaanhessmann commented 1 week ago

Hi @SoFarN,

on first sight, your submission script looks fine except for the warning about --ntasks vs --ntasks-per-node. I need some further information to understand what is going on. What version of schnetpack are you using? did you make any changes to the config files or code? Could you upload your config.yaml file that is saved to your model directory? Does the training work, if you only use CPUs?

SoFarN commented 1 week ago

Hi,

Thanks. It is working with changing in the submission file. Now, it works both on CPU and GPU.