Closed Vibsteamer closed 1 year ago
here is the training time reports using the de-bugging code from @iProzd
data load time during training matters a lot, and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task)
note that disp_freq
for 1) is 100 times lager than that for 2)
for set 1) containing ~50000 systems:
DEEPMD INFO initialize model from scratch
DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08
DEEPMD INFO start train time: 0.3218533992767334 s
DEEPMD INFO batch 200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s
...
DEEPMD INFO batch 400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s
for set 2) containing ~17 systems:
DEEPMD INFO batch 2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s
DEEPMD INFO batch 4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO batch 6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO batch 8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO batch 10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s
...
DEEPMD INFO finished training
DEEPMD INFO wall time: 99125.249 s
If I/O matters, could you try the HDF5 format?
If I/O matters, could you try the HDF5 format?
The issue is that the time recorded by deepmd-kit is does not cover all the important part in a training step...
If I/O matters, could you try the HDF5 format?
The issue is that the time recorded by deepmd-kit is does not cover all the important part in a training step...
This new result used modified code from my environment, which already covers all the time parts in a training step. "data load time" is the newly added part, filling up the I/O time each step before training.
It seems that when sets contain many systems, the I/O covers lots of time... Try my recently modified prefetching schedule again on these complex I/O scheme?
here is the training time reports using the de-bugging code from @iProzd
data load time during training matters a lot, and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task)
note that
disp_freq
for 1) is 100 times lager than that for 2)for set 1) containing ~50000 systems:
DEEPMD INFO initialize model from scratch DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08 DEEPMD INFO start train time: 0.3218533992767334 s DEEPMD INFO batch 200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s ... DEEPMD INFO batch 400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s
for set 2) containing ~17 systems:
DEEPMD INFO batch 2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s DEEPMD INFO batch 4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s DEEPMD INFO batch 6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s DEEPMD INFO batch 8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s DEEPMD INFO batch 10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s ... DEEPMD INFO finished training DEEPMD INFO wall time: 99125.249 s
What is the set size of the systems? Do you have significantly smaller setsize in your case (1) than (2)?
here is the training time reports using the de-bugging code from @iProzd data load time during training matters a lot, and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task) note that
disp_freq
for 1) is 100 times lager than that for 2) for set 1) containing ~50000 systems:DEEPMD INFO initialize model from scratch DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08 DEEPMD INFO start train time: 0.3218533992767334 s DEEPMD INFO batch 200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s ... DEEPMD INFO batch 400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s
for set 2) containing ~17 systems:
DEEPMD INFO batch 2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s DEEPMD INFO batch 4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s DEEPMD INFO batch 6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s DEEPMD INFO batch 8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s DEEPMD INFO batch 10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s ... DEEPMD INFO finished training DEEPMD INFO wall time: 99125.249 s
What is the set size of the systems? Do you have significantly smaller setsize in your case (1) than (2)?
yes, as described in summary, averaged set sizes are 1~2 for 1), and ~5000 for 2)
1)~80000 frames in ~50000 systems, task takes 52 hours, 2)~80000 frames in ~17 systems takes 18 hours, (type_mixed is used to collect the data)
When a set is completed used, deepmd-kit will load a new set from disk, also in the case that there is only one set in the system https://github.com/deepmodeling/deepmd-kit/blob/89d0d2339b22954f07765202f50e34eb87ec4080/deepmd/utils/data.py#L236-L237 I guess this introduces the overhead in data loading.
@iProzd This overhead should also happen in the data memory-friendly model, shouldn't it ?
Do you run on a supercomputer?
Do you run on a supercomputer?
It's on cloud, data and working_dir were sent together, thought to be on the same GPU machine.
Could you benchmark the performance of reading a file on your machine?
python -m timeit 'f=open("coord.npy");f.read();f.close()'
Could you benchmark the performance of reading a file on your machine?
python -m timeit 'f=open("coord.npy");f.read();f.close()'
npy can't be .read()
Traceback (most recent call last):
File "<timeit-src>", line 6, in inner
f=open("energy.npy");f.read();f.close()
File "/opt/deepmd-kit-2.1.5/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte
for npy without read/close :
python -m timeit 'f=open("coord.npy")'
20000 loops, best of 5: 14.5 usec per loop
for raw without read/close :
python -m timeit 'f=open("coord.raw")'
20000 loops, best of 5: 14.3 usec per loop
for raw with read and close :
python -m timeit 'f=open("coord.raw");f.read();f.close()'
20000 loops, best of 5: 17.7 usec per loop
It seems that I/O is not too expensive... (I got 200us on a supercomputer)
We may need to do profiling.
It seems that I/O is not too expensive... (I got 200us on a supercomputer)
We may need to do profiling.
updates, worse and varied reading-1-frame-coord.raw-performance are given from 300 us to 800 us on different machines (several times invoking the same cloud-machine-type)
please ignore the previous result (it's on a debugging machine, instead of a production machine)
20000 loops, best of 5: 17.7 usec per loop
This does make sense. You can consider converting all data into a single HDF5 file, which is opened only once.
system=dpdata.LabeledSystem("A/B/data1", fmt="deepmd/npy")
system.to_deepmd_hdf5("data.hdf5#/A/B/data1")
I see you are using DP-GEN. I plan to support the HDF5 format in the DP-GEN (https://github.com/deepmodeling/dpgen/issues/617) but haven't done it yet.
This does make sense. You can consider converting all data into a single HDF5 file, which is opened only once.
system=dpdata.LabeledSystem("A/B/data1", fmt="deepmd/npy") system.to_deepmd_hdf5("data.hdf5#/A/B/data1")
I see you are using DP-GEN. I plan to support the HDF5 format in the DP-GEN (deepmodeling/dpgen#617) but haven't done it yet.
Thank you for your help and suggestion, happy to konw the comming features. @iProzd FYI
happy rabbit year~
Could you try https://github.com/deepmodeling/dpgen/pull/1119?
Could you try deepmodeling/dpgen#1119?
It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file.
effects on training time is ongoing
Could you try deepmodeling/dpgen#1119?
It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file.
effects on training time is ongoing
HDF5 file seems to worsen the total-time problem,
original npy in 50000 systesm, with data statistics before training not skipped:
DEEPMD INFO batch 398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO wall time: 8920.850 s
HDF5, with data statistics before training not skipped:
DEEPMD INFO batch 398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO wall time: 19959.018 s
please hold the issue @wanghan-iapcm
HDF5 test with load-data time for each training batch will be updated later.
Could you try deepmodeling/dpgen#1119?
It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file. effects on training time is ongoing
HDF5 file seems to worsen the total-time problem,
original npy in 50000 systesm, with data statistics before training not skipped:
DEEPMD INFO batch 398000 training time 15.25 s, testing time 0.00 s ... DEEPMD INFO wall time: 8920.850 s
HDF5, with data statistics before training not skipped:
DEEPMD INFO batch 398000 training time 18.62 s, testing time 0.02 s ... DEEPMD INFO wall time: 19959.018 s
please hold the issue @wanghan-iapcm
HDF5 test with load-data time for each training batch will be updated later.
HDF5, with data load time being printed (this version of code also prints enlonged traing time):
DEEPMD INFO batch 102000 training time 20.83 s, testing time 0.00 s, data load time: 77.26 s
...
DEEPMD INFO wall time: 20418.568 s
original npy in 50000 systesm, with data statistics before training not skipped:
DEEPMD INFO batch 398000 training time 15.25 s, testing time 0.00 s ... DEEPMD INFO wall time: 8920.850 s
HDF5, with data statistics before training not skipped:
DEEPMD INFO batch 398000 training time 18.62 s, testing time 0.02 s ... DEEPMD INFO wall time: 19959.018 s
Do you use the same machine to compare them? I don't see why one has more training time with the same data.
this version of code also prints enlonged traing time
Can you submit a PR for this code?
original npy in 50000 systesm, with data statistics before training not skipped:
DEEPMD INFO batch 398000 training time 15.25 s, testing time 0.00 s ... DEEPMD INFO wall time: 8920.850 s
HDF5, with data statistics before training not skipped:
DEEPMD INFO batch 398000 training time 18.62 s, testing time 0.02 s ... DEEPMD INFO wall time: 19959.018 s
Do you use the same machine to compare them? I don't see why one has more training time with the same data.
It's the same cloud-machine-type. if there is difference, users would encounter the same difference in their production enviroment.
this version of code also prints enlonged traing time
Can you submit a PR for this code?
it's from @iProzd
You may run a short training with cProfile and provide the detailed profiling files in two cases: (1) on the slow machine; (2) on a faster machine. Then we can compare the difference between two.
Bug summary
Summary effectively the same sets (~80000 frames) the same other params in input single GPU
1)~80000 frames in ~50000 systems, task takes 52 hours, 2)~80000 frames in ~17 systems takes 18 hours, (
type_mixed
is used to collect the data)DeePMD-kit Version
DeePMD-kit v2.1.5
TensorFlow Version
2.9.0
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
discussed with and sets of 1) has been set to @iProzd previousely, it's said I/O should not influence the training time after data statistics
~4 hours before the training actually started (data statistics, and lcurve.out starts to write) "training time" in logs of both cases are effectively the same, note the disp_freq are 100 times larger for 1)
training time for 1) train_origin.log
training time for 2) train_typeSel.log
Steps to Reproduce
dp train
Further Information, Files, and Links
No response