# Problems in saving the entire model and executing onnx

chuanzeruge commented 3 months ago

Thank you for sharing the project. I am a beginner in this field, and currently, I have encountered issues while trying to save the entire pytorch model and exporting it to onnx.

While saving the vit model with provided checkpoints file, the error is as follows

Traceback (most recent call last):
  File "D:\py\Metric3D-main\export_pt.py", line 53, in <module>
    torch.save(model, EXPORT_PATH)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\serialization.py", line 441, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\serialization.py", line 653, in _save
    pickler.dump(obj)
AttributeError: Can't pickle local object 'LoRALayer.__init__.<locals>.<lambda>'

and while executing onnx（running metric3d_onnx_export.py）

Traceback (most recent call last):
  File "D:\桌面\模型\py\Metric3D-main\onnx\metric3d_onnx_export.py", line 121, in <module>
    Fire(main)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\fire\core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\fire\core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "D:\anaconda3\envs\mt3d\lib\site-packages\fire\core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "D:\py\Metric3D-main\onnx\metric3d_onnx_export.py", line 108, in main
    torch.onnx.export(
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\onnx\utils.py", line 506, in export
    _export(
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\onnx\utils.py", line 1548, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\onnx\utils.py", line 1113, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\onnx\utils.py", line 989, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\onnx\utils.py", line 893, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\jit\_trace.py", line 1268, in _get_trace_graph
    outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\jit\_trace.py", line 127, in forward
    graph, out = torch._C._create_graph_by_tracing(
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\jit\_trace.py", line 118, in wrapper
    outs.append(self.inner(*trace_inputs))
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1488, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "D:\py\Metric3D-main\onnx\metric3d_onnx_export.py", line 33, in forward
    pred_depth, confidence, output_dict = self.meta_arch.inference(
  File "C:\Users\12431/.cache\torch\hub\yvanyin_metric3d_main\mono\model\monodepth_model.py", line 12, in inference
    pred_depth, confidence, output_dict = self.forward(data)       
  File "C:\Users\12431/.cache\torch\hub\yvanyin_metric3d_main\mono\model\model_pipelines\__base_model__.py", line 13, in forward
    output = self.depth_model(**data)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1488, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "C:\Users\12431/.cache\torch\hub\yvanyin_metric3d_main\mono\model\model_pipelines\dense_pipeline.py", line 14, in forward
    features = self.encoder(input)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1488, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "C:\Users\12431/.cache\torch\hub\yvanyin_metric3d_main\mono\model\backbones\ViT_DINO_reg.py", line 1091, in forward
    ret = self.forward_features(*args, **kwargs)
  File "C:\Users\12431/.cache\torch\hub\yvanyin_metric3d_main\mono\model\backbones\ViT_DINO_reg.py", line 1017, in forward_features
    x = blk(x)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1488, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "C:\Users\12431/.cache\torch\hub\yvanyin_metric3d_main\mono\model\backbones\ViT_DINO_reg.py", line 749, in forward
    x = b(x)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1488, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "C:\Users\12431/.cache\torch\hub\yvanyin_metric3d_main\mono\model\backbones\ViT_DINO_reg.py", line 584, in forward
    x = x + attn_residual_func(x, attn_bias)
  File "C:\Users\12431/.cache\torch\hub\yvanyin_metric3d_main\mono\model\backbones\ViT_DINO_reg.py", line 559, in attn_residual_func
    return self.ls1(self.attn(self.norm1(x), attn_bias))
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\nn\modules\module.py", line 1488, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "C:\Users\12431/.cache\torch\hub\yvanyin_metric3d_main\mono\model\backbones\ViT_DINO_reg.py", line 475, in forward
    x = memory_efficient_attention(q, k, v)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\xformers\ops\fmha\__init__.py", line 193, in memory_efficient_attention
    return _memory_efficient_attention(
  File "D:\anaconda3\envs\mt3d\lib\site-packages\xformers\ops\fmha\__init__.py", line 291, in _memory_efficient_attention
    return _memory_efficient_attention_forward(
  File "D:\anaconda3\envs\mt3d\lib\site-packages\xformers\ops\fmha\__init__.py", line 311, in _memory_efficient_attention_forward
    out, *_ = op.apply(inp, needs_gradient=False)
  File "D:\anaconda3\envs\mt3d\lib\site-packages\xformers\ops\fmha\cutlass.py", line 186, in apply
    out, lse, rng_seed, rng_offset = cls.OPERATOR(
  File "D:\anaconda3\envs\mt3d\lib\site-packages\torch\_ops.py", line 502, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: unsupported output type: int, from operator: xformers::efficient_attention_forward_cutlass

I don't know if there is a certain aspect of the model architecture that doesn't support conversion.

JUGGHM commented 3 months ago

@ZachL1 Hi Zach, do you have any experience with this? I have never tested the codes in Windows environments. All experiments are conducted on Ubuntu systems.

ZachL1 commented 3 months ago

Sorry, I lack a proper Windows environment as well. It seems to be an onnx export issue though, and @Owen-Liuyuxuan may be able to help.

Owen-Liuyuxuan commented 2 months ago

@chuanzeruge cc: @JUGGHM

This is because xFormers is not exportable.

ViT model will be using xFormers' efficient implementation of Attention if xformers is installed.
If xformers is not installed, the model will fall back to use MultiheadAttention which is exportable.

My suggestion:

Use a temporary/virtual environment without xformers to export the model.'
On many platforms, xformers is easy to install and delete, so pip3 uninstall xformers before exporting the model, then pip3 install xformers afterward can help. (I did this)
Modify the codes in backbones/ViT_*.py, adjust the usage of xformers based on your need.

chuanzeruge commented 2 months ago

@Owen-Liuyuxuan Thank you for your explanation. This is helpful for a novice like me。QvQ

TLescoatTFX commented 1 month ago

I also encounter this error on Ubuntu, and if I uninstall xformers, I get the error from #126 (about tensors on different devices).

How can I solve this, to have an ONNX model ? @Owen-Liuyuxuan @JUGGHM

Thank you

Owen-Liuyuxuan commented 1 month ago

@TLescoatTFX Can you try running the model with dummy input first, without exporting to ONNX?

That error is related to the pytorch run-time instead of onnx exportation. I guess you should try performing inference without onnx exportation first, and see more detailed error logs.

TLescoatTFX commented 1 month ago

Thank you for the answer, I tried to run the exported model before exporting to ONNX (via dummy_output = export_model(dummy_input) and checking the output shape) and it seems to work correctly.

Errors only show when calling torch.onnx.export:

/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/backbones/ViT_DINO_reg.py:984: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if pad_h == self.patch_size:
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/backbones/ViT_DINO_reg.py:986: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if pad_w == self.patch_size:
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/backbones/ViT_DINO_reg.py:235: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert H % patch_H == 0, f"Input image height {H} is not a multiple of patch height {patch_H}"
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/backbones/ViT_DINO_reg.py:236: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert W % patch_W == 0, f"Input image width {W} is not a multiple of patch width: {patch_W}"
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/backbones/ViT_DINO_reg.py:910: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if npatch == N and w == h:
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/backbones/ViT_DINO_reg.py:922: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  sqrt_N = math.sqrt(N)
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/backbones/ViT_DINO_reg.py:923: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  sx, sy = float(w0) / sqrt_N, float(h0) / sqrt_N
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/backbones/ViT_DINO_reg.py:931: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert int(w0) == patch_pos_embed.shape[-2]
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/backbones/ViT_DINO_reg.py:931: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert int(w0) == patch_pos_embed.shape[-2]
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/backbones/ViT_DINO_reg.py:932: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert int(h0) == patch_pos_embed.shape[-1]
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/backbones/ViT_DINO_reg.py:932: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert int(h0) == patch_pos_embed.shape[-1]
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:894: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isnan(vit_features[0]).any():
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:896: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isinf(vit_features[0]).any():
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:908: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isnan(en_ft).any():
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:911: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isinf(en_ft).any():
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:919: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isnan(ref_feat).any():
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:921: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isinf(ref_feat).any():
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:815: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isnan(prob).any():
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:817: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isinf(prob).any():
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:831: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isnan(d ).any():
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:833: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isinf(d ).any():
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:842: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isnan(normal_out).any():
/home/thibault/.cache/torch/hub/yvanyin_metric3d_main/mono/model/decode_heads/RAFTDepthNormalDPTDecoder5.py:844: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if torch.isinf(normal_out).any():
/home/thibault/miniconda3/envs/md/lib/python3.10/site-packages/torch/onnx/utils.py:689: UserWarning: Constant folding in symbolic shape inference fails: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select) (Triggered internally at ../torch/csrc/jit/passes/onnx/shape_type_inference.cpp:439.)
  _C._jit_pass_onnx_graph_shape_type_inference(
============= Diagnostic Run torch.onnx.export version 2.0.1+cu117 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Traceback (most recent call last):
  File "/workspace/Metric3D/onnx/metric3d_onnx_export.py", line 134, in <module>
    Fire(main)
  File "/home/thibault/miniconda3/envs/md/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/thibault/miniconda3/envs/md/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/thibault/miniconda3/envs/md/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/Metric3D/onnx/metric3d_onnx_export.py", line 121, in main
    torch.onnx.export(
  File "/home/thibault/miniconda3/envs/md/lib/python3.10/site-packages/torch/onnx/utils.py", line 506, in export
    _export(
  File "/home/thibault/miniconda3/envs/md/lib/python3.10/site-packages/torch/onnx/utils.py", line 1548, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "/home/thibault/miniconda3/envs/md/lib/python3.10/site-packages/torch/onnx/utils.py", line 1180, in _model_to_graph
    params_dict = _C._jit_pass_onnx_constant_fold(
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

(btw I don't need the normals, if removing them allows to bypass the errors I'm fine with it)

Owen-Liuyuxuan commented 1 month ago

The warning you get in the normal calculation is "normal" and may not be the root cause that leads to the error.

Could you provide your cuda version, pytorch version (including the cuda version it compiles on), and onnx version?

TLescoatTFX commented 1 month ago

For the Python packages:

onnx                        1.16.2
onnxruntime                 1.19.0
onnxruntime-gpu             1.19.0
torch                       2.0.1
torchvision                 0.15.2

For CUDA:

GPU:    Nvidia RTX A5000
CUDA:   12.4
Driver: 550.90.07

Checking the CUDA version for Pytorch, it seems there is a mismatch, not sure if it is important

>>> torch.version.cuda
'11.7'

TLescoatTFX commented 1 month ago

I installed torch 2.4 and it is working with CUDA 12.1, and the export worked ! Thank you very much !

Now, I just need to convert it to CoreML... :/

YvanYin / Metric3D

# Problems in saving the entire model and executing onnx #127