NVIDIA / MinkowskiEngine

Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
https://nvidia.github.io/MinkowskiEngine
Other
2.43k stars 360 forks source link

Neither train nor eval work in completion.py inside v0.5.4 container #593

Open Divelix opened 3 months ago

Divelix commented 3 months ago

It seems that code in completion.py is outdated, because I can't run train or eval inside container.

To Reproduce: Docker image: pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel ME installation:

RUN git clone --recursive "https://github.com/NVIDIA/MinkowskiEngine"
RUN cd MinkowskiEngine; python setup.py install --force_cuda --blas=openblas

ME version: 0.5.4 Run command (inside MinkowsiEngine dir):

python -m examples.completion --eval

Expected behavior: just run without errors.

Eval error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/user/MinkowskiEngine/examples/completion.py", line 668, in <module>
    net.load_state_dict(checkpoint["state_dict"])
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CompletionNet:
        Missing key(s) in state_dict: "enc_block_s1.0.kernel", "enc_block_s1.1.bn.weight", "enc_block_s1.1.bn.bias", "enc_block_s1.1.bn.running_mean", "enc_block_s1.1.bn.running_var", "enc_block_s1s2.0.kernel", "enc_block_s1s2.1.bn.weight", "enc_block_s1s2.1.bn.bias", "enc_block_s1s2.1.bn.running_mean", "enc_block_s1s2.1.bn.running_var", "enc_block_s1s2.3.kernel", "enc_block_s1s2.4.bn.weight", "enc_block_s1s2.4.bn.bias", "enc_block_s1s2.4.bn.running_mean", "enc_block_s1s2.4.bn.running_var", "enc_block_s2s4.0.kernel", "enc_block_s2s4.1.bn.weight", "enc_block_s2s4.1.bn.bias", "enc_block_s2s4.1.bn.running_mean", "enc_block_s2s4.1.bn.running_var", "enc_block_s2s4.3.kernel", "enc_block_s2s4.4.bn.weight", "enc_block_s2s4.4.bn.bias", "enc_block_s2s4.4.bn.running_mean", "enc_block_s2s4.4.bn.running_var", "enc_block_s4s8.0.kernel", "enc_block_s4s8.1.bn.weight", "enc_block_s4s8.1.bn.bias", "enc_block_s4s8.1.bn.running_mean", "enc_block_s4s8.1.bn.running_var", "enc_block_s4s8.3.kernel", "enc_block_s4s8.4.bn.weight", "enc_block_s4s8.4.bn.bias", "enc_block_s4s8.4.bn.running_mean", "enc_block_s4s8.4.bn.running_var", "enc_block_s8s16.0.kernel", "enc_block_s8s16.1.bn.weight", "enc_block_s8s16.1.bn.bias", "enc_block_s8s16.1.bn.running_mean", "enc_block_s8s16.1.bn.running_var", "enc_block_s8s16.3.kernel", "enc_block_s8s16.4.bn.weight", "enc_block_s8s16.4.bn.bias", "enc_block_s8s16.4.bn.running_mean", "enc_block_s8s16.4.bn.running_var", "enc_block_s16s32.0.kernel", "enc_block_s16s32.1.bn.weight", "enc_block_s16s32.1.bn.bias", "enc_block_s16s32.1.bn.running_mean", "enc_block_s16s32.1.bn.running_var", "enc_block_s16s32.3.kernel", "enc_block_s16s32.4.bn.weight", "enc_block_s16s32.4.bn.bias", "enc_block_s16s32.4.bn.running_mean", "enc_block_s16s32.4.bn.running_var", "enc_block_s32s64.0.kernel", "enc_block_s32s64.1.bn.weight", "enc_block_s32s64.1.bn.bias", "enc_block_s32s64.1.bn.running_mean", "enc_block_s32s64.1.bn.running_var", "enc_block_s32s64.3.kernel", "enc_block_s32s64.4.bn.weight", "enc_block_s32s64.4.bn.bias", "enc_block_s32s64.4.bn.running_mean", "enc_block_s32s64.4.bn.running_var", "dec_block_s64s32.0.kernel", "dec_block_s64s32.1.bn.weight", "dec_block_s64s32.1.bn.bias", "dec_block_s64s32.1.bn.running_mean", "dec_block_s64s32.1.bn.running_var", "dec_block_s64s32.3.kernel", "dec_block_s64s32.4.bn.weight", "dec_block_s64s32.4.bn.bias", "dec_block_s64s32.4.bn.running_mean", "dec_block_s64s32.4.bn.running_var", "dec_s32_cls.kernel", "dec_s32_cls.bias", "dec_block_s32s16.0.kernel", "dec_block_s32s16.1.bn.weight", "dec_block_s32s16.1.bn.bias", "dec_block_s32s16.1.bn.running_mean", "dec_block_s32s16.1.bn.running_var", "dec_block_s32s16.3.kernel", "dec_block_s32s16.4.bn.weight", "dec_block_s32s16.4.bn.bias", "dec_block_s32s16.4.bn.running_mean", "dec_block_s32s16.4.bn.running_var", "dec_s16_cls.kernel", "dec_s16_cls.bias", "dec_block_s16s8.0.kernel", "dec_block_s16s8.1.bn.weight", "dec_block_s16s8.1.bn.bias", "dec_block_s16s8.1.bn.running_mean", "dec_block_s16s8.1.bn.running_var", "dec_block_s16s8.3.kernel", "dec_block_s16s8.4.bn.weight", "dec_block_s16s8.4.bn.bias", "dec_block_s16s8.4.bn.running_mean", "dec_block_s16s8.4.bn.running_var", "dec_s8_cls.kernel", "dec_s8_cls.bias", "dec_block_s8s4.0.kernel", "dec_block_s8s4.1.bn.weight", "dec_block_s8s4.1.bn.bias", "dec_block_s8s4.1.bn.running_mean", "dec_block_s8s4.1.bn.running_var", "dec_block_s8s4.3.kernel", "dec_block_s8s4.4.bn.weight", "dec_block_s8s4.4.bn.bias", "dec_block_s8s4.4.bn.running_mean", "dec_block_s8s4.4.bn.running_var", "dec_s4_cls.kernel", "dec_s4_cls.bias", "dec_block_s4s2.0.kernel", "dec_block_s4s2.1.bn.weight", "dec_block_s4s2.1.bn.bias", "dec_block_s4s2.1.bn.running_mean", "dec_block_s4s2.1.bn.running_var", "dec_block_s4s2.3.kernel", "dec_block_s4s2.4.bn.weight", "dec_block_s4s2.4.bn.bias", "dec_block_s4s2.4.bn.running_mean", "dec_block_s4s2.4.bn.running_var", "dec_s2_cls.kernel", "dec_s2_cls.bias", "dec_block_s2s1.0.kernel", "dec_block_s2s1.1.bn.weight", "dec_block_s2s1.1.bn.bias", "dec_block_s2s1.1.bn.running_mean", "dec_block_s2s1.1.bn.running_var", "dec_block_s2s1.3.kernel", "dec_block_s2s1.4.bn.weight", "dec_block_s2s1.4.bn.bias", "dec_block_s2s1.4.bn.running_mean", "dec_block_s2s1.4.bn.running_var", "dec_s1_cls.kernel", "dec_s1_cls.bias". 
        Unexpected key(s) in state_dict: "block1.0.kernel", "block1.1.bn.weight", "block1.1.bn.bias", "block1.1.bn.running_mean", "block1.1.bn.running_var", "block1.1.bn.num_batches_tracked", "block1.3.kernel", "block1.4.bn.weight", "block1.4.bn.bias", "block1.4.bn.running_mean", "block1.4.bn.running_var", "block1.4.bn.num_batches_tracked", "block1.6.kernel", "block1.7.bn.weight", "block1.7.bn.bias", "block1.7.bn.running_mean", "block1.7.bn.running_var", "block1.7.bn.num_batches_tracked", "block1.9.kernel", "block1.10.bn.weight", "block1.10.bn.bias", "block1.10.bn.running_mean", "block1.10.bn.running_var", "block1.10.bn.num_batches_tracked", "block1_cls.kernel", "block1_cls.bias", "block2.0.kernel", "block2.1.bn.weight", "block2.1.bn.bias", "block2.1.bn.running_mean", "block2.1.bn.running_var", "block2.1.bn.num_batches_tracked", "block2.3.kernel", "block2.4.bn.weight", "block2.4.bn.bias", "block2.4.bn.running_mean", "block2.4.bn.running_var", "block2.4.bn.num_batches_tracked", "block2_cls.kernel", "block2_cls.bias", "block3.0.kernel", "block3.1.bn.weight", "block3.1.bn.bias", "block3.1.bn.running_mean", "block3.1.bn.running_var", "block3.1.bn.num_batches_tracked", "block3.3.kernel", "block3.4.bn.weight", "block3.4.bn.bias", "block3.4.bn.running_mean", "block3.4.bn.running_var", "block3.4.bn.num_batches_tracked", "block3_cls.kernel", "block3_cls.bias", "block4.0.kernel", "block4.1.bn.weight", "block4.1.bn.bias", "block4.1.bn.running_mean", "block4.1.bn.running_var", "block4.1.bn.num_batches_tracked", "block4.3.kernel", "block4.4.bn.weight", "block4.4.bn.bias", "block4.4.bn.running_mean", "block4.4.bn.running_var", "block4.4.bn.num_batches_tracked", "block4_cls.kernel", "block4_cls.bias", "block5.0.kernel", "block5.1.bn.weight", "block5.1.bn.bias", "block5.1.bn.running_mean", "block5.1.bn.running_var", "block5.1.bn.num_batches_tracked", "block5.3.kernel", "block5.4.bn.weight", "block5.4.bn.bias", "block5.4.bn.running_mean", "block5.4.bn.running_var", "block5.4.bn.num_batches_tracked", "block5_cls.kernel", "block5_cls.bias", "block6.0.kernel", "block6.1.bn.weight", "block6.1.bn.bias", "block6.1.bn.running_mean", "block6.1.bn.running_var", "block6.1.bn.num_batches_tracked", "block6.3.kernel", "block6.4.bn.weight", "block6.4.bn.bias", "block6.4.bn.running_mean", "block6.4.bn.running_var", "block6.4.bn.num_batches_tracked", "block6_cls.kernel", "block6_cls.bias". 

Train error:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/user/MinkowskiEngine/examples/completion.py", line 658, in <module>
    train(net, dataloader, device, config)
  File "/home/user/MinkowskiEngine/examples/completion.py", line 534, in train
    data_dict = train_iter.next()
AttributeError: '_MultiProcessingDataLoaderIter' object has no attribute 'next'