NVIDIA-AI-IOT / torch2trt

An easy to use PyTorch to TensorRT converter
MIT License
4.55k stars 671 forks source link

ValueError: __len__() should return >= 0 #521

Open whcjb opened 3 years ago

whcjb commented 3 years ago

when use torch2trt convert the torch.eq, error occurs. mm = torch.eq(mm, 0.) mm is tensor and mm.shape = [3136, 1, 3, 3]

File "/media/cfs/torch2trt-master/examples/inpainting/model.py", line 329, in forward mm = torch.eq(mm, nn) File "./torch2trt/torch2trt.py", line 285, in wrapper converter"converter" File "./torch2trt/converters/compare.py", line 26, in convert_gt return convert_elementwise(ctx, trt.ElementWiseOperation.EQUAL) File "./torch2trt/converters/compare.py", line 9, in convert_elementwise input_a_trt, input_b_trt = broadcast_trt_tensors(ctx.network, [input_a_trt, input_b_trt], len(output.shape) - 1) File "./torch2trt/torch2trt.py", line 170, in broadcast_trt_tensors if len(t.shape) < broadcast_ndim: ValueError: len() should return >= 0

whcjb commented 3 years ago

someone can help?

jaybdub commented 3 years ago

Hi @whcjb,

Thanks for reaching out!

I would guess this is because it is comparing against a scalar. We may need to update the converter to handle this case.

I will try to look into this soon.

Best, John

pepinu commented 3 years ago

Hi @whcjb,

Thanks for reaching out!

I would guess this is because it is comparing against a scalar. We may need to update the converter to handle this case.

I will try to look into this soon.

Best, John

Hey @jaybdub,

I stumbled across pretty much the same error as the OP, and I can verify that it comes from t.shape being a scalar. In my case, the shape was (32744).

I've tried to write a simple case in the if statement here torch2trt.py#L174 with a condition not hasattr(t, '__len__') however I cannot get the shape = tuple([1] * diff + list(t.shape)) to work, same error as OP.

How should I go about it? I can make a PR after I get it to work.

jaybdub commented 3 years ago

Hi @pepinu,

Hmm, do you mind sharing the error you see at shape = tuple([1] * diff + list(t.shape))?

Also, thanks for your interest in addressing this!

It's difficult to tell exactly where the change should be applied without reproducing myself, but one other area of interest for this issue may be here

https://github.com/NVIDIA-AI-IOT/torch2trt/blob/44977a94cb087fe521421802e9df12a5ac3ceb3f/torch2trt/torch2trt.py#L140

This is where constant tensors are added to the TensorRT network for primitive types. Let me know if you discover anything here, or if there's anything you'd like me to investigate.

As general contributing guidelines, before integrating any solution we'll have to see if there are adverse side effects that might effect other models. One way to do this is to add module test cases that address this failure, and ensure that the existing test cases run.

Many of the converter files have examples of module test cases.

https://github.com/NVIDIA-AI-IOT/torch2trt/blob/44977a94cb087fe521421802e9df12a5ac3ceb3f/torch2trt/converters/compare.py#L51

The test cases may be run by calling

python3 -m torch2trt.test --name=converters --tolerance=1e-2

This test script was created for torch2trt and performs cross validation of the outputs against PyTorch. It will simply highlight high errors as "yellow", but not hard-fail. It might not cover all use cases. If the change requires a special type of test let me know.

Please let me know if this helps / you have any questions or if there is any way I can help.

Best, John

pepinu commented 3 years ago

Hey @jaybdub,

Thanks for the pointers, I'll take a look at this over the weekend.

Here is earlier mentioned err more in-depth:

  1. I've split the shape = tuple([1] * diff + list(t.shape)) to 3 lines as seen below: Screenshot 2021-03-20 at 23 34 24

The error here is the same as the OP, and happens when t.shape is put into the list:

Screenshot 2021-03-20 at 23 38 17
  1. I've tried to get it to work with this (L177): Screenshot 2021-03-20 at 23 39 42

but the error throws few lines after:

Screenshot 2021-03-21 at 23 49 33

I suspect it would just have to be unpacked within the shape reported in the error? Hope it clarifies a little bit.

pepinu commented 3 years ago

@jaybdub

Alright so I did some testing, I think I identified what the problem might be but I'm not sure how to proceed.

Basically, the problem in my case is not that the t is a scalar, the t.shape is a scalar. I've edited the last image in my earlier post because I had the wrong condition (if not hasattr(t, 'len') will not catch this).

The Problem https://github.com/NVIDIA-AI-IOT/torch2trt/blob/44977a94cb087fe521421802e9df12a5ac3ceb3f/torch2trt/torch2trt.py#L174 In the issue scenario, t.shape, which is of type trt.Dims has dimension 1 and looks like (32548). It has the len method but when invoked it throws the error. I tried to write a workaround with a lambda, but len is read-only in this case, so no luck there.

However, even if all len() calls are rewritten and length set arbitrarily to 1, the problem still persists here: https://github.com/NVIDIA-AI-IOT/torch2trt/blob/44977a94cb087fe521421802e9df12a5ac3ceb3f/torch2trt/torch2trt.py#L177
list() calls len() internally which crashes the conversion.

I've tried to use brackets to put the t.shape object into a list, but the results are not the same:

Screenshot 2021-03-22 at 00 09 12

I couldn't find a way to reproduce the same representation of the trt.Dims as in the traceback. list() makes it [32548] while tuple makes it (32548, ). I will look into finding a way to extract the t.shape value as represented when printed, maybe then I can somehow convert it inside.

I wonder if you have any pointers where I could look, maybe 1-dim tensor conversion is buggy?

Also, I'll try to get a minimal reproducible code for this, so it's reproducible.

Best regards

pepinu commented 3 years ago

@jaybdub @whcjb

I found out the problem is the trt not being able to process slice() operator in the same fashion torch does.

The network I was trying to port crashed on torch.add() operation between two tensors, while converting minimal torch.add op worked like a charm.

My model was cutting spatial dimensions using python slice() operation, instead of torch.narrow recommended for tensors.

To check this is the culprit I wrote and tested 2 versions of a network that narrows dims and adds them together:

Screenshot 2021-03-22 at 13 23 23 Screenshot 2021-03-22 at 13 23 27 Screenshot 2021-03-22 at 13 23 32 Screenshot 2021-03-22 at 13 23 50

I think the screen is self-explanatory, here's a gist to reproduce this.

I'm not sure where to go from here, there should be some type check for slice within the lib, hope it helps.

Best Regards

EDIT:

I looked at the last screen and see that tensors are not matching between TRT and normal model, which is weird? I was sure that they were while writing this...

Leerw commented 3 years ago

Hi @pepinu,

Hmm, do you mind sharing the error you see at shape = tuple([1] * diff + list(t.shape))?

Also, thanks for your interest in addressing this!

It's difficult to tell exactly where the change should be applied without reproducing myself, but one other area of interest for this issue may be here

https://github.com/NVIDIA-AI-IOT/torch2trt/blob/44977a94cb087fe521421802e9df12a5ac3ceb3f/torch2trt/torch2trt.py#L140

This is where constant tensors are added to the TensorRT network for primitive types. Let me know if you discover anything here, or if there's anything you'd like me to investigate.

As general contributing guidelines, before integrating any solution we'll have to see if there are adverse side effects that might effect other models. One way to do this is to add module test cases that address this failure, and ensure that the existing test cases run.

Many of the converter files have examples of module test cases.

https://github.com/NVIDIA-AI-IOT/torch2trt/blob/44977a94cb087fe521421802e9df12a5ac3ceb3f/torch2trt/converters/compare.py#L51

The test cases may be run by calling

python3 -m torch2trt.test --name=converters --tolerance=1e-2

This test script was created for torch2trt and performs cross validation of the outputs against PyTorch. It will simply highlight high errors as "yellow", but not hard-fail. It might not cover all use cases. If the change requires a special type of test let me know.

Please let me know if this helps / you have any questions or if there is any way I can help.

Best, John

for my case, in https://github.com/NVIDIA-AI-IOT/torch2trt/blob/44977a94cb087fe521421802e9df12a5ac3ceb3f/torch2trt/torch2trt.py#L157

shape=(576, 960) and weight.shape=(1,1,576,960)

after run this line, I print t._trt I got

[TensorRT] ERROR: [SHUFFLE #2] torch.Tensor.view(tensor(shape=[576], dtype=torch.float32), -1, 1): volume mismatch. Input dimensions [576] have volume 576 and output dimensions [1] have volume 1.
ValueError: __len__() should return >= 0
DuyguSerbes commented 2 years ago

Guys, do you have a final solution regarding that issue?

pwais commented 2 years ago

+1 I am seeing a case where (perhaps a scalar) has a len of -1 according to tensorRT

I also seem to run into similar errors if a tensor (or argument to forward ) is None (this should probably just be pruned from the TRT conversion?)

InfiniteLife commented 2 years ago

Same problem

RaiAmanRai commented 2 years ago

hey @jaybdub , can you give some inputs on how long will it take before this is fixed?

Tegala commented 2 years ago

I meet same problem too, wish for some solution @jaybdub 0.0

kct22aws commented 1 year ago

Any ETA on this problem? Without the fix, torch2trt won't work for many models I tried: Hugging Face vision transformer, swintransformer, ViViT...etc

iariav commented 1 year ago

@kct22aws +1 on that question

dcming commented 1 year ago

+1 on that question

Emilon1928 commented 1 year ago

+1 on that question

shuyangsun commented 10 months ago

+1 on that question

StanleyPain commented 2 months ago

2024 and no fix? Anyone get any traction on this?