[BUG] Training wall time is abnormally long when sets contain many systems

Vibsteamer commented 1 year ago

Bug summary

Summary effectively the same sets (~80000 frames) the same other params in input single GPU

1)~80000 frames in ~50000 systems, task takes 52 hours, 2)~80000 frames in ~17 systems takes 18 hours, (type_mixed is used to collect the data)

DeePMD-kit Version

DeePMD-kit v2.1.5

TensorFlow Version

2.9.0

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

discussed with and sets of 1) has been set to @iProzd previousely, it's said I/O should not influence the training time after data statistics

~4 hours before the training actually started (data statistics, and lcurve.out starts to write) "training time" in logs of both cases are effectively the same, note the disp_freq are 100 times larger for 1)

training time for 1) train_origin.log

...
DEEPMD INFO    batch 7800000 training time 1580.50 s, testing time 0.00 s
DEEPMD INFO    batch 8000000 training time 1569.11 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 188106.747 s

training time for 2) train_typeSel.log

...
DEEPMD INFO    batch 7998000 training time 15.41 s, testing time 0.00 s
DEEPMD INFO    batch 8000000 training time 15.60 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 65437.235 s

Steps to Reproduce

dp train

Further Information, Files, and Links

No response

Vibsteamer commented 1 year ago

here is the training time reports using the de-bugging code from @iProzd

data load time during training matters a lot, and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task)

note that disp_freq for 1) is 100 times lager than that for 2)

for set 1) containing ~50000 systems:

DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08
DEEPMD INFO    start train time: 0.3218533992767334 s
DEEPMD INFO    batch  200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s
...
DEEPMD INFO    batch  400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s

for set 2) containing ~17 systems:

DEEPMD INFO    batch    2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s
DEEPMD INFO    batch    4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch   10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s
...
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 99125.249 s

njzjz commented 1 year ago

If I/O matters, could you try the HDF5 format?

wanghan-iapcm commented 1 year ago

If I/O matters, could you try the HDF5 format?

The issue is that the time recorded by deepmd-kit is does not cover all the important part in a training step...

iProzd commented 1 year ago

If I/O matters, could you try the HDF5 format?

The issue is that the time recorded by deepmd-kit is does not cover all the important part in a training step...

This new result used modified code from my environment, which already covers all the time parts in a training step. "data load time" is the newly added part, filling up the I/O time each step before training.

iProzd commented 1 year ago

It seems that when sets contain many systems, the I/O covers lots of time... Try my recently modified prefetching schedule again on these complex I/O scheme?

wanghan-iapcm commented 1 year ago

here is the training time reports using the de-bugging code from @iProzd

data load time during training matters a lot, and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task)

note that disp_freq for 1) is 100 times lager than that for 2)

for set 1) containing ~50000 systems:

DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08
DEEPMD INFO    start train time: 0.3218533992767334 s
DEEPMD INFO    batch  200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s
...
DEEPMD INFO    batch  400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s

for set 2) containing ~17 systems:

DEEPMD INFO    batch    2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s
DEEPMD INFO    batch    4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch   10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s
...
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 99125.249 s

What is the set size of the systems? Do you have significantly smaller setsize in your case (1) than (2)?

Vibsteamer commented 1 year ago

here is the training time reports using the de-bugging code from @iProzd data load time during training matters a lot, and this de-bugging code give longer training time for 2) (~16s vs ~24 s per 2000 batch, 18h vs 27h to finish the task) note that disp_freq for 1) is 100 times lager than that for 2) for set 1) containing ~50000 systems:
DEEPMD INFO    initialize model from scratch
DEEPMD INFO    start training at lr 1.00e-03 (== 1.00e-03), decay_step 40000, decay_rate 0.944061, final lr will be 1.00e-08
DEEPMD INFO    start train time: 0.3218533992767334 s
DEEPMD INFO    batch  200000 training time 2173.19 s, testing time 0.01 s, data load time: 2826.80 s
...
DEEPMD INFO    batch  400000 training time 2177.85 s, testing time 0.01 s, data load time: 2817.95 s
for set 2) containing ~17 systems:
DEEPMD INFO    batch    2000 training time 24.62 s, testing time 0.01 s, data load time: 2.21 s
DEEPMD INFO    batch    4000 training time 23.85 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    6000 training time 23.94 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch    8000 training time 23.90 s, testing time 0.01 s, data load time: 0.26 s
DEEPMD INFO    batch   10000 training time 23.91 s, testing time 0.01 s, data load time: 0.26 s
...
DEEPMD INFO    finished training
DEEPMD INFO    wall time: 99125.249 s
What is the set size of the systems? Do you have significantly smaller setsize in your case (1) than (2)?

yes, as described in summary, averaged set sizes are 1~2 for 1), and ~5000 for 2)

1)~80000 frames in ~50000 systems, task takes 52 hours, 2)~80000 frames in ~17 systems takes 18 hours, (type_mixed is used to collect the data)

wanghan-iapcm commented 1 year ago

When a set is completed used, deepmd-kit will load a new set from disk, also in the case that there is only one set in the system https://github.com/deepmodeling/deepmd-kit/blob/89d0d2339b22954f07765202f50e34eb87ec4080/deepmd/utils/data.py#L236-L237 I guess this introduces the overhead in data loading.

@iProzd This overhead should also happen in the data memory-friendly model, shouldn't it ?

njzjz commented 1 year ago

Do you run on a supercomputer?

Vibsteamer commented 1 year ago

Do you run on a supercomputer?

It's on cloud, data and working_dir were sent together, thought to be on the same GPU machine.

njzjz commented 1 year ago

Could you benchmark the performance of reading a file on your machine?

python -m timeit 'f=open("coord.npy");f.read();f.close()'

Vibsteamer commented 1 year ago

Could you benchmark the performance of reading a file on your machine?
python -m timeit 'f=open("coord.npy");f.read();f.close()'

npy can't be .read()

Traceback (most recent call last):
  File "<timeit-src>", line 6, in inner
    f=open("energy.npy");f.read();f.close()
  File "/opt/deepmd-kit-2.1.5/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 0: invalid start byte

for npy without read/close :

python -m timeit 'f=open("coord.npy")'

20000 loops, best of 5: 14.5 usec per loop

for raw without read/close :

python -m timeit 'f=open("coord.raw")'

20000 loops, best of 5: 14.3 usec per loop

for raw with read and close :

python -m timeit 'f=open("coord.raw");f.read();f.close()'

20000 loops, best of 5: 17.7 usec per loop

njzjz commented 1 year ago

It seems that I/O is not too expensive... (I got 200us on a supercomputer)

We may need to do profiling.

Vibsteamer commented 1 year ago

It seems that I/O is not too expensive... (I got 200us on a supercomputer)

We may need to do profiling.

updates, worse and varied reading-1-frame-coord.raw-performance are given from 300 us to 800 us on different machines (several times invoking the same cloud-machine-type)

please ignore the previous result (it's on a debugging machine, instead of a production machine)

20000 loops, best of 5: 17.7 usec per loop

njzjz commented 1 year ago

This does make sense. You can consider converting all data into a single HDF5 file, which is opened only once.

system=dpdata.LabeledSystem("A/B/data1", fmt="deepmd/npy")
system.to_deepmd_hdf5("data.hdf5#/A/B/data1")

I see you are using DP-GEN. I plan to support the HDF5 format in the DP-GEN (https://github.com/deepmodeling/dpgen/issues/617) but haven't done it yet.

Vibsteamer commented 1 year ago

This does make sense. You can consider converting all data into a single HDF5 file, which is opened only once.
system=dpdata.LabeledSystem("A/B/data1", fmt="deepmd/npy")
system.to_deepmd_hdf5("data.hdf5#/A/B/data1")
I see you are using DP-GEN. I plan to support the HDF5 format in the DP-GEN (deepmodeling/dpgen#617) but haven't done it yet.

Thank you for your help and suggestion, happy to konw the comming features. @iProzd FYI

happy rabbit year~

njzjz commented 1 year ago

Could you try https://github.com/deepmodeling/dpgen/pull/1119?

Vibsteamer commented 1 year ago

Could you try deepmodeling/dpgen#1119?

It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file.

effects on training time is ongoing

Vibsteamer commented 1 year ago

Could you try deepmodeling/dpgen#1119?

It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file.

effects on training time is ongoing

HDF5 file seems to worsen the total-time problem,

original npy in 50000 systesm, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s

HDF5, with data statistics before training not skipped:

DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s

please hold the issue @wanghan-iapcm

HDF5 test with load-data time for each training batch will be updated later.

Vibsteamer commented 1 year ago

Could you try deepmodeling/dpgen#1119?

It takes 5 mins to convert 80000 frames in 50000 systems into a 275M HDF5 file. effects on training time is ongoing

HDF5 file seems to worsen the total-time problem,

original npy in 50000 systesm, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s
HDF5, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s
please hold the issue @wanghan-iapcm

HDF5 test with load-data time for each training batch will be updated later.

HDF5, with data load time being printed (this version of code also prints enlonged traing time):

DEEPMD INFO    batch  102000 training time 20.83 s, testing time 0.00 s, data load time: 77.26 s
...
DEEPMD INFO    wall time: 20418.568 s

njzjz commented 1 year ago

original npy in 50000 systesm, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s
HDF5, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s

Do you use the same machine to compare them? I don't see why one has more training time with the same data.

njzjz commented 1 year ago

this version of code also prints enlonged traing time

Can you submit a PR for this code?

Vibsteamer commented 1 year ago

original npy in 50000 systesm, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 15.25 s, testing time 0.00 s
...
DEEPMD INFO    wall time: 8920.850 s
HDF5, with data statistics before training not skipped:
DEEPMD INFO    batch  398000 training time 18.62 s, testing time 0.02 s
...
DEEPMD INFO    wall time: 19959.018 s
Do you use the same machine to compare them? I don't see why one has more training time with the same data.

It's the same cloud-machine-type. if there is difference, users would encounter the same difference in their production enviroment.

Vibsteamer commented 1 year ago

this version of code also prints enlonged traing time

Can you submit a PR for this code?

it's from @iProzd

njzjz commented 1 year ago

You may run a short training with cProfile and provide the detailed profiling files in two cases: (1) on the slow machine; (2) on a faster machine. Then we can compare the difference between two.

deepmodeling / deepmd-kit