atomistic-machine-learning / schnetpack

SchNetPack - Deep Neural Networks for Atomistic Systems
Other
772 stars 213 forks source link

New schnetpack is very slow on training #433

Closed sssaennn closed 1 year ago

sssaennn commented 2 years ago

Training the schnet model with the qm9 dataset using Schnetpack1.0 took only 100 seconds per epoch, but Schnetpack2.0 will take over 300 seconds, I'm pretty sure the training is done on the gpu. So I use pytorch profiler to check what happened during the training.

Look at the validation step: image

image In the validation step, GPU Utilization is at a normal value.

But let us see the training step: image

image During the training step, GPU Utilization is at a low value or even 0, and almost all time is spent on CPU.

Can you check this problem? Thanks.

ktschuett commented 2 years ago

This is a weird thing. We have had similar problems occurring randomly. I thought at first it is a problem with our cluster, since it does not happen on my local computer. For @NiklasGebauer, it only happened after he set up a new virtual environment. Therefore, it might be caused by a certain (combination?) of python packages, but I am not entirely sure. Could you post your python environment (e.g. conda list or pip list)?

sssaennn commented 2 years ago

this is my environment with conda env export > environment.yml environment.txt

sssaennn commented 2 years ago

You may be right. When I run another lightning-hydra model (the structure is edited by new Schnetpack) in the same environment, I don't have the same problem at all, and the python packages used for that model are a bit different from the new Schnetpack, but I'm not sure which packages are causing the problem.

Is there any tool that lists all packages used in the current program?

ktschuett commented 2 years ago

On my side, one epoch of QM9 runs takes about 60 seconds on a P100. I will try the same with your environment and report back later.

sssaennn commented 2 years ago

On my side, one epoch of QM9 runs takes about 60 seconds on a P100. I will try the same with your environment and report back later.

Could you give me your environment file (like environment.txt) ? I'd like to test if it works normally when I use your environment, thanks.

ktschuett commented 2 years ago

Is there any tool that lists all packages used in the current program?

I am not aware, but that would indeed be useful.

sssaennn commented 2 years ago

Is there any tool that lists all packages used in the current program?

I am not aware, but that would indeed be useful.

I found pipreqs package to do this from https://stackoverflow.com/questions/35796968/get-all-modules-packages-used-by-a-python-project

ktschuett commented 2 years ago

Is there any tool that lists all packages used in the current program?

I am not aware, but that would indeed be useful.

I found pipreqs package to do this from https://stackoverflow.com/questions/35796968/get-all-modules-packages-used-by-a-python-project

Nice! I am trying to export my environment, but conda is giving version errors...

sssaennn commented 2 years ago
ase==3.22.1
auto_mix_prep==0.2.0
dirsync==2.2.4
fasteners==0.17.3
h5py==3.7.0
matscipy==0.7.0
scipy==1.7.3
tensorboardX==2.5.1
tqdm==4.64.0

The list is all the packages that are used in the new Schnetpack, but not in my lightning-hydra model, including the version I installed. I think you can confirm if these packages are causing the problem.

ktschuett commented 2 years ago

Ok, I don't know what the matter with the conda export, but apparently it is a common bug, that it does not export when there is something wrong in package requirements. Here are my versions of the packages:

ase==3.22.0
auto_mix_prep -> not installed (also don't know why it should be used by schnetpack)
dirsync==2.2.5
fasteners==0.16.3
h5py==3.0.0
matscipy==0.7.0+145.gaa33010
scipy==1.5.4
tensorboardx==2.1
tqdm==4.64.0

I'll try your environment now.

sssaennn commented 2 years ago

Well, I tried to use your environment except matscipy because I couldn't find the 0.7.0+145.gaa33010 version, but the problem wasn't solved. I guess I need your complete environment (including pip and conda) so that I can run a normal job first.

By the way, I can give my complete environment to you:

this file is from pip freeze > requirements.txt : requirements.txt

and this file is from pipreqs at schnetpack2.0 path: requirements.txt

sssaennn commented 2 years ago

Well, I tried to use your environment except matscipy because I couldn't find the 0.7.0+145.gaa33010 version, but the problem wasn't solved. I guess I need your complete environment (including pip and conda) so that I can run a normal job first.

By the way, I can give my complete environment to you:

this file is from pip freeze > requirements.txt : requirements.txt

and this file is from pipreqs at schnetpack2.0 path: requirements.txt

And this file is the output of my job, I've masked a few private messages, but you can see that Epoch 0 took over 3 minutes to complete just over 50%.

zonezone12 commented 2 years ago

I have run schnetpack under this command spktrain experiment=qm9 model/representation=schnet trainer=debug_trainer and it took about 6min per epoch.

my versions of the packages: ase 3.22.1 dirsync 2.2.5 fasteners 0.17.3 h5py 3.7.0 hydra-colorlog 1.2.0 hydra-core 1.2.0 matscipy 0.7.0+145.gaa33010 numpy 1.23.2 python 3.8.13 pytorch-lightning 1.6.5 PyYAML 6.0 schnetpack 1.0.0.dev0 scipy 1.9.1 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 torch 1.12.1 torchmetrics 0.9.3 tqdm 4.64.1

ktschuett commented 2 years ago

I think I found the issue! In the output file, I see that the flag trainer.detect_anomaly is set to true in the config. This should be false! This might have been the default in a previous version, but currently it is false by default. Could you retry and report whether that was the problem?

ktschuett commented 2 years ago

and it took about 6min per epoch.

Same issue here: The debug trainer uses detect_anomaly to identify NaNs, which seems to be quite slow.

sssaennn commented 2 years ago

I think I found the issue! In the output file, I see that the flag trainer.detect_anomaly is set to true in the config. This should be false! This might have been the default in a previous version, but currently it is false by default. Could you retry and report whether that was the problem?

Please see this log file. After setting detect_anomaly to False, the training speed does improve, but only from about 6 minutes to 4 and a half minutes. I think the main reason for this problem comes from other sources than detect_anomaly. Anyway, thanks to report this finding, it is useful.

sssaennn commented 2 years ago

I think I found the issue! In the output file, I see that the flag trainer.detect_anomaly is set to true in the config. This should be false! This might have been the default in a previous version, but currently it is false by default. Could you retry and report whether that was the problem?

Please see this log file. After setting detect_anomaly to False, the training speed does improve, but only from about 6 minutes to 4 and a half minutes. I think the main reason for this problem comes from other sources than detect_anomaly. Anyway, thanks to report this finding, it is useful.

with detect_anomaly=False : image image

ktschuett commented 2 years ago

From the log file it seems your validation is also quite slow. Since you're running on a cluster, I have two more ideas:

  1. How much CPUs did you assign to the job? Since we now have a batch layout that is more efficient for large systems, we can't use something like the SimpleEnvironmentProvider anymore. Therefore the preprocessing might be a bit more CPU-intensive. Try 2-4 CPU cores (real, not hyperthreads) and 4 train and validation workers.
  2. If your data is stored on a slow shared drive, you might want to copy to a fast local directory on the node. SchNetPack can take care of that: when you set data.data_workdir=/your/local/directory, the data is automatically copied there.

Let me know if this helps.

sssaennn commented 2 years ago

From the log file it seems your validation is also quite slow. Since you're running on a cluster, I have two more ideas:

  1. How much CPUs did you assign to the job? Since we now have a batch layout that is more efficient for large systems, we can't use something like the SimpleEnvironmentProvider anymore. Therefore the preprocessing might be a bit more CPU-intensive. Try 2-4 CPU cores (real, not hyperthreads) and 4 train and validation workers.
  2. If your data is stored on a slow shared drive, you might want to copy to a fast local directory on the node. SchNetPack can take care of that: when you set data.data_workdir=/your/local/directory, the data is automatically copied there.

Let me know if this helps.

Thank you for your advice.

My computing platform is one of the top 60 supercomputers, so I don't think it's a hardware problem.

For point 1, each node has 8 GPUs and 36 CPUs, and the CPU model is Intel® Xeon® Gold 6154. I use 4 CPU cores to run my job.

For point 2, I use the Hyper File System (HFS) to store my data and I think it should not be a slow shared drive.

For my environment, Schnetpack 1.0 no doubt works fine, while Schnetpack 2.0 still takes 1.5 times longer than Schnetpack 1.0 to complete 1 epoch when detect_anomaly=False .

Edited: Both Schnetpack 1.0 and Schnetpack 2.0 ran on the same cluster.

Edited2: I have tried to set validation workers = 4, but time cost still over 4 minutes.

sssaennn commented 2 years ago

From the log file it seems your validation is also quite slow. Since you're running on a cluster, I have two more ideas:

  1. How much CPUs did you assign to the job? Since we now have a batch layout that is more efficient for large systems, we can't use something like the SimpleEnvironmentProvider anymore. Therefore the preprocessing might be a bit more CPU-intensive. Try 2-4 CPU cores (real, not hyperthreads) and 4 train and validation workers.
  2. If your data is stored on a slow shared drive, you might want to copy to a fast local directory on the node. SchNetPack can take care of that: when you set data.data_workdir=/your/local/directory, the data is automatically copied there.

Let me know if this helps.

Thank you for your advice.

My computing platform is one of the top 60 supercomputers, so I don't think it's a hardware problem.

For point 1, each node has 8 GPUs and 36 CPUs, and the CPU model is Intel® Xeon® Gold 6154. I use 4 CPU cores to run my job.

For point 2, I use the Hyper File System (HFS) to store my data and I think it should not be a slow shared drive.

For my environment, Schnetpack 1.0 no doubt works fine, while Schnetpack 2.0 still takes 1.5 times longer than Schnetpack 1.0 to complete 1 epoch when detect_anomaly=False .

Also, if I run another lightning-hydra model with the same configuration and environment, it works fine:

Edited: In the case of detect_anomaly=False, this model can improve speed to 7 times faster than detect_anomaly=True!

Edited2: Both Schnetpack 1.0, Schnetpack 2.0 and this model all ran on the same cluster.

image image

I still think the problem should be caused by some function or version in the package(s).

sssaennn commented 2 years ago

I have an idea: Lightning-hydra version of Schnetpack 1.0. Without changing any of the model structures and functions used in Schnetpack 1.0 (e.g. without using scatter_add, _atoms_collate_fn and other new functions), just use the lightning- hydra template to replace the necessary parts, and then try to run the training once with my environment. Regardless of whether the results are slower or the same compared to the original Schnetpack 1.0, the problem can be narrowed down and we are able to figure out if the problem comes from lightning-hydra or those new functions / structures, what do you think? Would this be too time consuming? Or do you think this should be unhelpful?

ktschuett commented 2 years ago

Besides the fact that this would be a lot of work, I don't think this would help because the SchNetPack 2 code works fast on our computer cluster and you already found that Lightning/Hydra works fine on your side with a different model. Therefore there must be some environment/hardware dependency or conflict.

I finally got around to set up and export a conda environment similar the one you provided. You also have to install schnetpack with pip install git+https://github.com/atomistic-machine-learning/schnetpack.git, since installation via git does not seem to be provided by the conda YAML.

Calling this with: spktrain experiment=qm9 data.num_train=110000 data.num_val=1000 data.batch_size=100 data.datapath=/home/kschuett/data/qm9.db data.data_workdir=/tmp/kristof/spktest model/representation=schnet yields 16 it/s in training, leading to approximately 1 min for an epoch including validation.

Perhaps you can try again with this? You could also send me your trace file, maybe there is some hint what op is causing the slowdown that I can't see on the picture.

sssaennn commented 2 years ago

I will try your suggestion, thanks!

sssaennn commented 2 years ago

I think I've found the problem. Have you tried using python 3.8? I have been using Python 3.8 environment and when I switched to Python 3.9 everything became much more normal, at least now I can finish 1 epoch in 2 minutes. I hope you can help me to verify this part. I found this out from the environment file you provided, thanks for your help.

sssaennn commented 2 years ago

I think I've found the problem. Have you tried using python 3.8? I have been using Python 3.8 environment and when I switched to Python 3.9 everything became much more normal, at least now I can finish 1 epoch in 2 minutes. I hope you can help me to verify this part. I found this out from the environment file you provided, thanks for your help.

And if you are sure the problem is with the Python version, I think you need to change the contents of the Requirements.

ktschuett commented 1 year ago

I tried some more things and the python version does not make a difference for me. However, I got a warning from pytorch 1.10 / CUDA 10.2 that A100 is not properly supported, but only with python 3.8 (?). I am not sure how things are working with your V100, but perhaps you might also want to try updating to the newest pytorch / CUDA.

I'll close for now, but we'll keep an eye on this.