deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.43k stars 494 forks source link

[BUG] Inconsistency between the batch_size specifications in input.json for tf and pt backends #3770

Closed Yi-FanLi closed 3 months ago

Yi-FanLi commented 3 months ago

Bug summary

The tensorflow backend allows using list to specify "batch_size". However, this seems not allowed in the pytorch backend. Should they behave similarly in this behavior?

DeePMD-kit Version

3.0.0a0

Backend and its version

PyTorch v2.0.0.post200-gc263bd43e8e

How did you download the software?

docker

Input Files, Running Commands, Error Log, etc.

The part that matters in input.json:

    "training_data": {
        "systems": [
            "O64H128"
        ],
        "batch_size": [
            1
        ]
    },

Error log:

[2024-05-11 04:57:30,934] DEEPMD INFO -------------------------------------------------------------------------- /opt/deepmd-kit/lib/python3.11/site-packages/deepmd/utils/compat.py:362: UserWarning: The argument training->numb_test has been deprecated since v2.0.0. Use training->validation_data->batch_size instead. warnings.warn( Traceback (most recent call last): File "/opt/deepmd-kit/bin/dp", line 10, in sys.exit(main()) ^^^^^^ File "/opt/deepmd-kit/lib/python3.11/site-packages/deepmd/main.py", line 805, in main deepmd_main(args) File "/opt/deepmd-kit/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/opt/deepmd-kit/lib/python3.11/site-packages/deepmd/pt/entrypoints/main.py", line 308, in main train(FLAGS) File "/opt/deepmd-kit/lib/python3.11/site-packages/deepmd/pt/entrypoints/main.py", line 270, in train trainer = get_trainer( ^^^^^^^^^^^^ File "/opt/deepmd-kit/lib/python3.11/site-packages/deepmd/pt/entrypoints/main.py", line 166, in get_trainer ) = prepare_trainer_input_single( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/deepmd-kit/lib/python3.11/site-packages/deepmd/pt/entrypoints/main.py", line 149, in prepare_trainer_input_single train_data_single = DpLoaderSet( ^^^^^^^^^^^^ File "/opt/deepmd-kit/lib/python3.11/site-packages/deepmd/pt/utils/dataloader.py", line 116, in init system_dataloader = DataLoader( ^^^^^^^^^^^ File "/opt/deepmd-kit/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 357, in init batch_sampler = BatchSampler(sampler, batch_size, drop_last) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/deepmd-kit/lib/python3.11/site-packages/torch/utils/data/sampler.py", line 232, in init raise ValueError("batch_size should be a positive integer value, " ValueError: batch_size should be a positive integer value, but got batch_size=[1]

Steps to Reproduce

See the tarbal.

Further Information, Files, and Links

issue_batch_size.tar.gz

njzjz commented 3 months ago

Duplicate of #3475