Performance of the package

ICAMS / python-ace

Other

72 stars 19 forks source link

Performance of the package #77

Open mingwang-zhong opened 2 months ago

mingwang-zhong commented 2 months ago

Dear authors,

I have several questions regarding the performance of this package.

I have trained the ACE potential for the Al-Si alloy with 1551 structures (16 atoms each), using 20% for training. The process is fast, with 2s taken for one iteration. However, the process stops at Loss=0.035, which is too high. I tried another training by reading the best intermediate potential, but this training stops immediately. How can I continue the training process to reduce the loss?
It took 4000 iterations to reach Loss=0.035, suggesting that the convergence is slow. In another question #45, you suggested using a large kappa. In my case, even with kapp=0.95, the loss only reaches 0.0053 after 4000 iterations. Do you have any other suggestions or ideas to improve the convergence? I have run the examples of high entropy alloy and ethanol given in your package, which give loss=0.066 for HEA (1,5000 iterations) and loss=0.0005 for ethanol (56,000 iterations). I am wondering how small the loss values can be if the scripts run as long as possible, and how good are these ACE potentials.
I received a message soon after I launched the training process: "Calling GradientTape.gradient on a persistent tape inside its context is significantly less efficient than calling it outside the context (it causes the gradient ops to be recorded on the tape, leading to increased CPU and memory usage). Only call GradientTape.gradient inside the context if you actually want to trace the gradient in order to compute higher order derivatives". It seems that something is happening to make the training process slow. How can I resolve it?
I tried the training process on CPU (24 cores) and GPU (NVIDIA V100). The speed on GPU is slower than on CPU. Is it expected or could there be an issue with my setup? But I got no errors or warnings during compilation.

Thanks! Mingwang

yury-lysogorskiy commented 2 months ago

Dear authors,

I have several questions regarding the performance of this package.

I have trained the ACE potential for the Al-Si alloy with 1551 structures (16 atoms each), using 20% for training. The process is fast, with 2s taken for one iteration. However, the process stops at Loss=0.035, which is too high.

ignore loss, because it depends on dataset and many other settings. Compare energy and forces metrics.

I tried another training by reading the best intermediate potential, but this training stops immediately. How can I continue the training process to reduce the loss?

https://pacemaker.readthedocs.io/en/latest/pacemaker/faq/#optimization_stops_too_early_due_to_too_small_updates_but_i_want_to_run_it_longer

It took 4000 iterations to reach Loss=0.035, suggesting that the convergence is slow. In another question fitting convergence slow #45, you suggested using a large kappa. In my case, even with kapp=0.95, the loss only reaches 0.0053 after 4000 iterations. Do you have any other suggestions or ideas to improve the convergence? I have run the examples of high entropy alloy and ethanol given in your package, which give loss=0.066 for HEA (1,5000 iterations) and loss=0.0005 for ethanol (56,000 iterations). I am wondering how small the loss values can be if the scripts run as long as possible, and how good are these ACE potentials.

ignore loss, focus on energy and force metrics. How large are they for you?

I received a message soon after I launched the training process: "Calling GradientTape.gradient on a persistent tape inside its context is significantly less efficient than calling it outside the context (it causes the gradient ops to be recorded on the tape, leading to increased CPU and memory usage). Only call GradientTape.gradient inside the context if you actually want to trace the gradient in order to compute higher order derivatives". It seems that something is happening to make the training process slow. How can I resolve it?

this is normal operational message, because second derivative is taken (F=-dE/dr, and then dF/d_weights for min. algorithm), just ignore it

I tried the training process on CPU (24 cores) and GPU (NVIDIA V100). The speed on GPU is slower than on CPU. Is it expected or could there be an issue with my setup? But I got no errors or warnings during compilation.

Are you sure TF did use GPU ? add --verbose-tf flag, do you see something like "V100 32768Mb ..." ?

mingwang-zhong commented 2 months ago

ignore loss, because it depends on dataset and many other settings. Compare energy and forces metrics.

The RMSE of energy is 225 meV/atom and RMSE of force is 103 meV/A. The pair plots of test data are also attached. test_EF-pairplots

https://pacemaker.readthedocs.io/en/latest/pacemaker/faq/#optimization_stops_too_early_due_to_too_small_updates_but_i_want_to_run_it_longer

The default gtol value is 1e-8, so I do not understand why the fitting still stops too early.

ignore loss, focus on energy and force metrics. How large are they for you?

Please see the first response.

this is normal operational message, because second derivative is taken (F=-dE/dr, and then dF/d_weights for min. algorithm), just ignore it

Thanks for your explanation.

Are you sure TF did use GPU ? add --verbose-tf flag, do you see something like "V100 32768Mb ..." ?

When I installed this package, I requested a GPU node. When I run the fitting script, I also request a GPU node, so I guess TF used GPU. The GPU I used was always V100-SXM2, which indeed has 32Gb memory. Here is the benchmark for 100 iterations:

CPU: 275 seconds GPU: 277 seconds GPU (with --verbose-tf enabled): 293 seconds

Thanks!

mingwang-zhong commented 2 months ago

Are you sure TF did use GPU ? add --verbose-tf flag, do you see something like "V100 32768Mb ..." ?

I see what you mean. After I add --verbose-tf flag when I run the script, I do not find any results with "GPU" or "V100". So TF should not use GPU. Then how could I enable GPU?

yury-lysogorskiy commented 2 months ago

1) RMSE of force = 103 meV/A looks not bad. Which weighting scheme do you use - uniform or energy-based ? If latter, then check Energy_low and Forces_low (both MAE and RMSE) metrics. General rule, if error distributed normally, then RMSE/MAE approx equal 2. If RMSE >> MAE, then there are certain outliers taht can't be fitted. Check the whole complete table, that you have printed by pacemaker, not just one number 2) Energy fits looks bad. Do you have kappa very close to 1 ? Upfit with kappa=0.3. 3) The default gtol value is 1e-8, so I do not understand why the fitting still stops too early. If you will read carefully the output of pacemaker (i.e. log.txt), then maybe you fill find the final message from optimizer, that can give a hint. 4) --verbose-tf does not influence on the perfromance anyhow, it is just ew extra info lines from TF, that helps to understand the problem. Do you see any message related to CUDA and/or drivers ? You can attach complete output with --verbose-tf so I could check it 4) Then how could I enable GPU? Did you load any module with CUDA drivers (sometimes it is needed) ? Did you install TensorFlow by yourself or you had one as a module already? Check documentation for your cluster

mingwang-zhong commented 2 months ago

These are for kappa = auto = 0.13. RMSE/MAE for Energy_low and Forces_low are ~ 1.2 - 1.5. I have also run fitting for kappa = 0.5 and 0.95, yielding similar results of RMSE/MAE and pair plots. The log files are attached.

log_kappa0.13.txt log_kappa0.5.txt log_kappa0.95.txt

Please see the attached file. kappa = 0.5 produces similar RMSE/MAE
log.txt shows Optimization result(success=False and Last relative TRAIN loss change -1.83e-05/iter. It seems that the fitting halts because the loss change is sufficiently small. However, reducing the min_relative_train_loss_per_iter value in input.yaml in another fitting does not help. Based on common sense, I believe the program stops due to reaching the given number of iteration or based on certain criteria that measure the quality of fit, such as the loss value. However, I could not find a clear message indicating this in log.txt. Could you provide more explicit details on the stopping criteria?
I could not find any messages related to CUDA. Please see the attached log_GPU.txt.

log_GPU.txt

I indeed loaded CUDA drivers using the command module load cuda/11.1. I installed TensorFlow by myself, using the commands almost the same as the documentation. See the following:

module load anaconda3/2022.05
module load cmake/3.27.9
module load gcc/11.1.0
module load cuda/11.1

conda create -n ace python=3.9 -y
source activate ace
pip install tensorflow[and-cuda]

wget https://github.com/ICAMS/TensorPotential/archive/refs/heads/main.zip
unzip main.zip
cd TensorPotential-main/
pip install --upgrade .
cd ..

wget https://github.com/ICAMS/python-ace/archive/refs/heads/master.zip
unzip master.zip
cd python-ace-master
pip install --upgrade .

Thanks, Mingwang

yury-lysogorskiy commented 2 months ago

a) I have a strong suspicion that something wrong with your energies data, their fit look like random. how did you get/prepare energy_corrected? describe full procedure, please. Also, at the end of the fit, you should get "results/" folder with figures. Could you show them? b) There is clear message: Optimization result(success=False, status=1, message=Maximum number of iterations has been exceeded., nfev=2005, njev=2005) , it reached 2000 iterations. Last metrics also shown for exactly 2000th iteration. No any stop due to small loss change occurs. c) Unfortunately, log_GPU.txt dos not contains this technical info, it usually redirected to STDOUT. If you run it with queuing system, then the file looks like "job.out" or "slurm-1234xxxx.out", etc. It should contain info as in log.txt + another output from underlying libraries (CUDA, TF, etc.)

mingwang-zhong commented 2 months ago

a). The data might be a factor affecting the fit quality. Our goal is to obtain the interatomic potential of Al-based alloy using DFT data. At this point, we use the data from LAMMPS simulations of the Al-Si alloy based on the AEAM potential (research paper and LAMMPS package ).

The LAMMPS data preparation follows this procedure: I ran 51 simulation with 16 atoms each, varying Si composition from 0 to 1 with increments of 0.02. For compositions below 0.8, the initial structure was FCC, and for value above 0.8, diamond. The temperature in each simulation, fixed within an NPT ensemble, ranged from 600 K to 1800 K, with increments of 40 K. These simulations ran for 1e5 steps per temperature, and force data were collected every 1e5 steps.

Here is the fitting result for kappa=0.5: report.zip

b). Another issue arises. The input maxiter is 2000, but the fitting process stops at 4000 iterations. This means that under certain conditions, such as at the first 2000 iterations, the fitting procedure can still continue automatically. Similarly, in the example of high entropy alloy, the input maxiter is 1500, but the fitting process runs for > 1500*10 iterations. I would like to know which parameter in input.yaml controls the total number of iterations.

c). I found some clues in error.txt. It turns out TensorFlow could not locate some libraries, as shown below. It seems necessary for me to contact the server staff to resolve this. BTW, how fast is GPU compared with multi-core CPU?

2024-09-12 20:10:22.216516: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /shared/centos7/cuda/11.1/lib64:/shared/centos7/cuda/11.1/targets/x86_64-linux/lib:/shared/centos7/gcc/11.1.0/lib64:/shared/centos7/gcc/11.1.0/libexec/gcc/x86_64-pc-linux-gnu/11.1.0/:/shared/centos7/gcc/11.1.0/lib/gcc/x86_64-pc-linux-gnu/11.1.0/:/shared/centos7/anaconda3/2022.05/lib
2024-09-12 20:10:22.217174: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

Thanks, Mingwang

mingwang-zhong commented 2 months ago

a). To calculate energy_corrected, I first ran LAMMPS simulations with a single Al and Si atoms. The simulations resulted in a single atom energy of -1.2857599 eV for Al and -0.0011080677 eV for Si. Then I added reference_energy: { Al: -1.2857599, Si: -0.0011080677} to input.yaml.

yury-lysogorskiy commented 2 months ago

a). The data might be a factor affecting the fit quality. Our goal is to obtain the interatomic potential of Al-based alloy using DFT data. At this point, we use the data from LAMMPS simulations of the Al-Si alloy based on the AEAM potential (research paper and LAMMPS package ).

Are you sure that this pair potential is correctly implementd, that it has force-consistent energies? Check if with numerical differentiation.

The LAMMPS data preparation follows this procedure: I ran 51 simulation with 16 atoms each, varying Si composition from 0 to 1 with increments of 0.02. For compositions below 0.8, the initial structure was FCC, and for value above 0.8, diamond. The temperature in each simulation, fixed within an NPT ensemble, ranged from 600 K to 1800 K, with increments of 40 K. These simulations ran for 1e5 steps per temperature, and force data were collected every 1e5 steps.

Here is the fitting result for kappa=0.5: report.zip

b). Another issue arises. The input maxiter is 2000, but the fitting process stops at 4000 iterations. This means that under certain conditions, such as at the first 2000 iterations, the fitting procedure can still continue automatically. Similarly, in the example of high entropy alloy, the input maxiter is 1500, but the fitting process runs for > 1500*10 iterations. I would like to know which parameter in input.yaml controls the total number of iterations.

maxiter is for EACH ladder step. If you want to fit faster (bat potentially less accurate), do NOT use ladder fitting.

c). I found some clues in error.txt. It turns out TensorFlow could not locate some libraries, as shown below. It seems necessary for me to contact the server staff to resolve this. BTW, how fast is GPU compared with multi-core CPU?

could be x5-x10, depends on GPU and CPU

2024-09-12 20:10:22.216516: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /shared/centos7/cuda/11.1/lib64:/shared/centos7/cuda/11.1/targets/x86_64-linux/lib:/shared/centos7/gcc/11.1.0/lib64:/shared/centos7/gcc/11.1.0/libexec/gcc/x86_64-pc-linux-gnu/11.1.0/:/shared/centos7/gcc/11.1.0/lib/gcc/x86_64-pc-linux-gnu/11.1.0/:/shared/centos7/anaconda3/2022.05/lib
2024-09-12 20:10:22.217174: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1850] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

Thanks, Mingwang

mingwang-zhong commented 2 months ago

Are you sure that this pair potential is correctly implementd, that it has force-consistent energies? Check if with numerical differentiation.

I have calculated the curves of energy vs lattice constant for this AEAM potential, as shown below. They may not be perfect, but look like smooth. So the force should be consistent with the energy.

energy_vs_a

maxiter is for EACH ladder step. If you want to fit faster (bat potentially less accurate), do NOT use ladder fitting.

So now there are two schemes. One is without ladder fitting, which is fast but less accurate. The other is with ladder fitting, which is slow but more accurate. However, in my case where the ladder is used, the fitting is not accurate because it terminates early...

could be x5-x10, depends on GPU and CPU

That's a huge increase!

Thanks, Mingwang

yury-lysogorskiy commented 2 months ago

To check the force consistency, you need to take numerical derivative F_i=-dE/dr_i and not E(V) curve.
Another consistency test, is to reduce training set and check if you can fit at least it.

-why not to use real DFT data after all? You will need it anyway

mingwang-zhong commented 2 months ago

Thank you for your explanation.

Could you please provide more details about this method? Since I do not have the explicit expression of E(r), how could I obtain dE/dr from the LAMMPS simulations?
I reduced the training set to 50 configurations. The training produced RMSE/MAE of 1.44 for Energy_low and 1.53 for Force_low. The pair plots for the training set look promising:

However, the pair plots for the test set show poor results: ![test_EF-pairplots]()

I am still unclear about the early stopping. You mentioned the previous fitting stopped due to reaching the maximum iteration, so I tested another run with maxiter set to 10,000 (ladder_step: 200 and gtol = 1e-8). This run finished one round of 10,000 iterations and stopped after another 6,616 iterations, not reaching the maximum of this round. The fitting result are similar to before, showing scattered energy data points. I am still wondering why the fitting stopped at this iteration.

We chose to test the LAMMPS data first because I have experience using LAMMPS but not familiar with VASP or Quantum Espresso. We are also considering running DFT calculations next.

Thanks, Mingwang

mingwang-zhong commented 2 months ago

Not sure why the previous images are not shown.

Training set:

Test set: