Closed sssaennn closed 1 year ago
This is a weird thing. We have had similar problems occurring randomly. I thought at first it is a problem with our cluster, since it does not happen on my local computer. For @NiklasGebauer, it only happened after he set up a new virtual environment. Therefore, it might be caused by a certain (combination?) of python packages, but I am not entirely sure. Could you post your python environment (e.g. conda list
or pip list
)?
this is my environment with conda env export > environment.yml
environment.txt
You may be right. When I run another lightning-hydra model (the structure is edited by new Schnetpack) in the same environment, I don't have the same problem at all, and the python packages used for that model are a bit different from the new Schnetpack, but I'm not sure which packages are causing the problem.
Is there any tool that lists all packages used in the current program?
On my side, one epoch of QM9 runs takes about 60 seconds on a P100. I will try the same with your environment and report back later.
On my side, one epoch of QM9 runs takes about 60 seconds on a P100. I will try the same with your environment and report back later.
Could you give me your environment file (like environment.txt
) ? I'd like to test if it works normally when I use your environment, thanks.
Is there any tool that lists all packages used in the current program?
I am not aware, but that would indeed be useful.
Is there any tool that lists all packages used in the current program?
I am not aware, but that would indeed be useful.
I found pipreqs package to do this from https://stackoverflow.com/questions/35796968/get-all-modules-packages-used-by-a-python-project
Is there any tool that lists all packages used in the current program?
I am not aware, but that would indeed be useful.
I found pipreqs package to do this from https://stackoverflow.com/questions/35796968/get-all-modules-packages-used-by-a-python-project
Nice! I am trying to export my environment, but conda is giving version errors...
ase==3.22.1
auto_mix_prep==0.2.0
dirsync==2.2.4
fasteners==0.17.3
h5py==3.7.0
matscipy==0.7.0
scipy==1.7.3
tensorboardX==2.5.1
tqdm==4.64.0
The list is all the packages that are used in the new Schnetpack, but not in my lightning-hydra model, including the version I installed. I think you can confirm if these packages are causing the problem.
Ok, I don't know what the matter with the conda export, but apparently it is a common bug, that it does not export when there is something wrong in package requirements. Here are my versions of the packages:
ase==3.22.0
auto_mix_prep -> not installed (also don't know why it should be used by schnetpack)
dirsync==2.2.5
fasteners==0.16.3
h5py==3.0.0
matscipy==0.7.0+145.gaa33010
scipy==1.5.4
tensorboardx==2.1
tqdm==4.64.0
I'll try your environment now.
Well, I tried to use your environment except matscipy because I couldn't find the 0.7.0+145.gaa33010 version, but the problem wasn't solved. I guess I need your complete environment (including pip and conda) so that I can run a normal job first.
By the way, I can give my complete environment to you:
this file is from pip freeze > requirements.txt
: requirements.txt
and this file is from pipreqs
at schnetpack2.0 path: requirements.txt
Well, I tried to use your environment except matscipy because I couldn't find the 0.7.0+145.gaa33010 version, but the problem wasn't solved. I guess I need your complete environment (including pip and conda) so that I can run a normal job first.
By the way, I can give my complete environment to you:
this file is from
pip freeze > requirements.txt
: requirements.txtand this file is from
pipreqs
at schnetpack2.0 path: requirements.txt
And this file is the output of my job, I've masked a few private messages, but you can see that Epoch 0 took over 3 minutes to complete just over 50%.
I have run schnetpack under this command spktrain experiment=qm9 model/representation=schnet trainer=debug_trainer and it took about 6min per epoch.
my versions of the packages: ase 3.22.1 dirsync 2.2.5 fasteners 0.17.3 h5py 3.7.0 hydra-colorlog 1.2.0 hydra-core 1.2.0 matscipy 0.7.0+145.gaa33010 numpy 1.23.2 python 3.8.13 pytorch-lightning 1.6.5 PyYAML 6.0 schnetpack 1.0.0.dev0 scipy 1.9.1 tensorboard 2.9.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 torch 1.12.1 torchmetrics 0.9.3 tqdm 4.64.1
I think I found the issue! In the output file, I see that the flag trainer.detect_anomaly
is set to true
in the config. This should be false
! This might have been the default in a previous version, but currently it is false
by default. Could you retry and report whether that was the problem?
and it took about 6min per epoch.
Same issue here: The debug trainer uses detect_anomaly
to identify NaNs, which seems to be quite slow.
I think I found the issue! In the output file, I see that the flag
trainer.detect_anomaly
is set totrue
in the config. This should befalse
! This might have been the default in a previous version, but currently it isfalse
by default. Could you retry and report whether that was the problem?
Please see this log file.
After setting detect_anomaly
to False, the training speed does improve, but only from about 6 minutes to 4 and a half minutes.
I think the main reason for this problem comes from other sources than detect_anomaly
.
Anyway, thanks to report this finding, it is useful.
I think I found the issue! In the output file, I see that the flag
trainer.detect_anomaly
is set totrue
in the config. This should befalse
! This might have been the default in a previous version, but currently it isfalse
by default. Could you retry and report whether that was the problem?Please see this log file. After setting
detect_anomaly
to False, the training speed does improve, but only from about 6 minutes to 4 and a half minutes. I think the main reason for this problem comes from other sources thandetect_anomaly
. Anyway, thanks to report this finding, it is useful.
with detect_anomaly=False
:
From the log file it seems your validation is also quite slow. Since you're running on a cluster, I have two more ideas:
data.data_workdir=/your/local/directory
, the data is automatically copied there.Let me know if this helps.
From the log file it seems your validation is also quite slow. Since you're running on a cluster, I have two more ideas:
- How much CPUs did you assign to the job? Since we now have a batch layout that is more efficient for large systems, we can't use something like the SimpleEnvironmentProvider anymore. Therefore the preprocessing might be a bit more CPU-intensive. Try 2-4 CPU cores (real, not hyperthreads) and 4 train and validation workers.
- If your data is stored on a slow shared drive, you might want to copy to a fast local directory on the node. SchNetPack can take care of that: when you set
data.data_workdir=/your/local/directory
, the data is automatically copied there.Let me know if this helps.
Thank you for your advice.
My computing platform is one of the top 60 supercomputers, so I don't think it's a hardware problem.
For point 1, each node has 8 GPUs and 36 CPUs, and the CPU model is Intel® Xeon® Gold 6154. I use 4 CPU cores to run my job.
For point 2, I use the Hyper File System (HFS) to store my data and I think it should not be a slow shared drive.
For my environment, Schnetpack 1.0 no doubt works fine, while Schnetpack 2.0 still takes 1.5 times longer than Schnetpack 1.0 to complete 1 epoch when detect_anomaly=False
.
Edited: Both Schnetpack 1.0 and Schnetpack 2.0 ran on the same cluster.
Edited2: I have tried to set validation workers = 4
, but time cost still over 4 minutes.
From the log file it seems your validation is also quite slow. Since you're running on a cluster, I have two more ideas:
- How much CPUs did you assign to the job? Since we now have a batch layout that is more efficient for large systems, we can't use something like the SimpleEnvironmentProvider anymore. Therefore the preprocessing might be a bit more CPU-intensive. Try 2-4 CPU cores (real, not hyperthreads) and 4 train and validation workers.
- If your data is stored on a slow shared drive, you might want to copy to a fast local directory on the node. SchNetPack can take care of that: when you set
data.data_workdir=/your/local/directory
, the data is automatically copied there.Let me know if this helps.
Thank you for your advice.
My computing platform is one of the top 60 supercomputers, so I don't think it's a hardware problem.
For point 1, each node has 8 GPUs and 36 CPUs, and the CPU model is Intel® Xeon® Gold 6154. I use 4 CPU cores to run my job.
For point 2, I use the Hyper File System (HFS) to store my data and I think it should not be a slow shared drive.
For my environment, Schnetpack 1.0 no doubt works fine, while Schnetpack 2.0 still takes 1.5 times longer than Schnetpack 1.0 to complete 1 epoch when
detect_anomaly=False
.
Also, if I run another lightning-hydra model with the same configuration and environment, it works fine:
Edited: In the case of detect_anomaly=False
, this model can improve speed to 7 times faster than detect_anomaly=True
!
Edited2: Both Schnetpack 1.0, Schnetpack 2.0 and this model all ran on the same cluster.
I still think the problem should be caused by some function or version in the package(s).
I have an idea: Lightning-hydra version of Schnetpack 1.0.
Without changing any of the model structures and functions used in Schnetpack 1.0 (e.g. without using scatter_add
, _atoms_collate_fn
and other new functions), just use the lightning- hydra template to replace the necessary parts, and then try to run the training once with my environment.
Regardless of whether the results are slower or the same compared to the original Schnetpack 1.0, the problem can be narrowed down and we are able to figure out if the problem comes from lightning-hydra or those new functions / structures, what do you think? Would this be too time consuming? Or do you think this should be unhelpful?
Besides the fact that this would be a lot of work, I don't think this would help because the SchNetPack 2 code works fast on our computer cluster and you already found that Lightning/Hydra works fine on your side with a different model. Therefore there must be some environment/hardware dependency or conflict.
I finally got around to set up and export a conda
environment
similar the one you provided. You also have to install schnetpack with pip install git+https://github.com/atomistic-machine-learning/schnetpack.git
, since installation via git does not seem to be provided by the conda YAML.
Calling this with:
spktrain experiment=qm9 data.num_train=110000 data.num_val=1000 data.batch_size=100 data.datapath=/home/kschuett/data/qm9.db data.data_workdir=/tmp/kristof/spktest model/representation=schnet
yields 16 it/s in training, leading to approximately 1 min for an epoch including validation.
Perhaps you can try again with this? You could also send me your trace file, maybe there is some hint what op is causing the slowdown that I can't see on the picture.
I will try your suggestion, thanks!
I think I've found the problem. Have you tried using python 3.8
?
I have been using Python 3.8
environment and when I switched to Python 3.9
everything became much more normal, at least now I can finish 1 epoch in 2 minutes. I hope you can help me to verify this part.
I found this out from the environment file you provided, thanks for your help.
I think I've found the problem. Have you tried using
python 3.8
? I have been usingPython 3.8
environment and when I switched toPython 3.9
everything became much more normal, at least now I can finish 1 epoch in 2 minutes. I hope you can help me to verify this part. I found this out from the environment file you provided, thanks for your help.
And if you are sure the problem is with the Python version, I think you need to change the contents of the Requirements.
I tried some more things and the python version does not make a difference for me. However, I got a warning from pytorch 1.10 / CUDA 10.2 that A100 is not properly supported, but only with python 3.8 (?). I am not sure how things are working with your V100, but perhaps you might also want to try updating to the newest pytorch / CUDA.
I'll close for now, but we'll keep an eye on this.
Training the schnet model with the qm9 dataset using Schnetpack1.0 took only 100 seconds per epoch, but Schnetpack2.0 will take over 300 seconds, I'm pretty sure the training is done on the gpu. So I use pytorch profiler to check what happened during the training.
Look at the validation step:
In the validation step, GPU Utilization is at a normal value.
But let us see the training step:
During the training step, GPU Utilization is at a low value or even 0, and almost all time is spent on CPU.
Can you check this problem? Thanks.