dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.78k stars 2.98k forks source link

Installing python packages #925

Closed Noam-M closed 1 year ago

Noam-M commented 3 years ago

Hello

I want to run my own python code using the container. I get an error message: from scipy import ndimage ModuleNotFoundError: No module named 'scipy'

Is there a way to automatically install the required packages? Or I should manually install each one individually (and if so, how do I do it? I have a fresh new Xavier NX and have little experience with using python CLI.

It's interesting to note that running my script from outside the container does not ouput an error message about scipy (but outputs an error about a missing package that follows (skimage))

Thanks!

dusty-nv commented 3 years ago

Hi @Noam-M, you would need to install scipy install the container (with pip3 install scipy --verbose). You install it inside the container by running it from the container's terminal after you start the terminal. After you install it, you will want to save your container to a new tag with docker commit: https://docs.docker.com/engine/reference/commandline/commit/

The alternative to docker commit is creating your own Dockerfile which uses the jetson-inference container as a base, and installing the packages you need in the dockerfile. Either way, I recommend that you try out docker commit or Dockerfile first before spending the time to install scipy, because I recall that it takes a while to install it.

Note that if you aren't actually using jetson-inference library, you can use the l4t-ml container instead which already includes scipy installed.

Noam-M commented 3 years ago

Thanks @dusty-nv for the advice! I tried installing scipy inside the container but got the following error:

root@nvidia:/# pip3 install scipy --verbose
Collecting scipy
  1 location(s) to search for versions of scipy:
  * https://pypi.python.org/simple/scipy/
  Getting page https://pypi.python.org/simple/scipy/
  Looking up "https://pypi.python.org/simple/scipy/" in the cache
  Returning cached "301 Moved Permanently" response (ignoring date and etag information)
  Looking up "https://pypi.org/simple/scipy/" in the cache
  No cache entry available
  Starting new HTTPS connection (1): pypi.org
  Incremented Retry for (url='/simple/scipy/'): Retry(total=4, connect=None, read=None, redirect=None, status=None)
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scipy/
  Starting new HTTPS connection (2): pypi.org
  Incremented Retry for (url='/simple/scipy/'): Retry(total=3, connect=None, read=None, redirect=None, status=None)
  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scipy/
  Starting new HTTPS connection (3): pypi.org
  Incremented Retry for (url='/simple/scipy/'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scipy/
  Starting new HTTPS connection (4): pypi.org
  Incremented Retry for (url='/simple/scipy/'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scipy/
  Starting new HTTPS connection (5): pypi.org
  Incremented Retry for (url='/simple/scipy/'): Retry(total=0, connect=None, read=None, redirect=None, status=None)
  Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scipy/
  Starting new HTTPS connection (6): pypi.org
  Could not fetch URL https://pypi.python.org/simple/scipy/: connection error: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/scipy/ (Caused by ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)) - skipping
  Could not find a version that satisfies the requirement scipy (from versions: )
Cleaning up...
No matching distribution found for scipy
Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/lib/python3/dist-packages/pip/commands/install.py", line 353, in run
    wb.build(autobuilding=True)
  File "/usr/lib/python3/dist-packages/pip/wheel.py", line 749, in build
    self.requirement_set.prepare_files(self.finder)
  File "/usr/lib/python3/dist-packages/pip/req/req_set.py", line 380, in prepare_files
    ignore_dependencies=self.ignore_dependencies))
  File "/usr/lib/python3/dist-packages/pip/req/req_set.py", line 554, in _prepare_file
    require_hashes
  File "/usr/lib/python3/dist-packages/pip/req/req_install.py", line 278, in populate_link
    self.link = finder.find_requirement(self, upgrade)
  File "/usr/lib/python3/dist-packages/pip/index.py", line 514, in find_requirement
    'No matching distribution found for %s' % req
pip.exceptions.DistributionNotFound: No matching distribution found for scipy

I could try to download the wheels and then manually install it but then I am afraid I will have version collisions, and that could create a mess.

Also, in order to use the l4t-ml container - do I have to build it locally? I tried to clone in and run the docker_run.sh and I got this error:

nvidia@nvidia:~/jetson-containers$ ./scripts/docker_run.sh 
localuser:root being added to access control list
Unable to find image 'nvcr.io/nvidian/nvidia-l4t-base:r32.4' locally
docker: Error response from daemon: unauthorized: authentication required.
See 'docker run --help'.

instead of cloning and running, I tried pulling and running, using the following command:

sudo docker pull nvcr.io/nvidia/l4t-ml:r32.4.4-py3
sudo docker run -it --rm --runtime nvidia --network host -v /home/user/project:/location/in/container nvcr.io/nvidia/l4t-ml:r32.4.4-py3

Running the contatiner itself actually did work but now it couldn't find the cv2 package

What is the correct way to use the l4t-ml container? Also I would like to use the TensorRT optimizations and I don't know if the l4t-ml container has that option ready

So I'm not sure what's the best way to proceed? installing packages manually to the jetson-inference, to the l4t-ml container? maybe not using a container at all and trying another method?

Thanks again

dusty-nv commented 3 years ago

Running the contatiner itself actually did work but now it couldn't find the cv2 package

cv2 was added to l4t-ml container in the r32.5.0 container for JetPack 4.5. This was how I installed the version of OpenCV from JetPack into the container: https://github.com/dusty-nv/jetson-containers/blob/78148a41dba2718159dfa7179faa535eda03d0cc/Dockerfile.ml#L84 You could adopt this into your own Dockerfile, or rebuild my latest Dockerfiles on your JetPack.

If you don't care about that specific OpenCV version that comes with JetPack, you could just to a normal apt-get update && apt-get install ... to install Ubuntu's OpenCV into the container.

Also I would like to use the TensorRT optimizations and I don't know if the l4t-ml container has that option ready

Yes, l4t-ml/l4t-pytorch/l4t-tensorflow containers all include TensorRT

dusty-nv commented 3 years ago

Also I am not sure why you couldn't install scipy through pip3 - it seems like a network issue, because that works here.

Noam-M commented 3 years ago

Thank you @dusty-nv !

I'm still a bit worried that i'm getting myself into a dependence-hell, but I'm following your suggestions, and also running it from within the container is a safe sandbox for now, but I'll have to later figure out how to move forward and not installing everything each time I run the container.

So now i'm working from within the l4t-ml container.

It seems like upgrading to JetPack 4.5 requires resetting the whole system and I'm not sure if i'ts necessary.

I have successfully downloaded and installed opencv using the apt-get. Then I needed the scikit-image package, but it couldn't download it, producing the following error: E: Unable to locate package scikit-image I tried also using pip3 but no luck, with the same reason as before with the scipy (which was on the jetson-inference container):

Collecting scikit-image
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scikit-image/
  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scikit-image/
  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scikit-image/
  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scikit-image/
  Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scikit-image/
  Could not find a version that satisfies the requirement scikit-image (from versions: )
No matching distribution found for scikit-image

It seems like not all packages are showing the same behavior. I tried restarting the container and repeating the commands and got the same results, so it's not related to the internet connection. Maybe different packages use different ports for downloading? Could it be that the container by default has some issues with that?

Thanks again!

dusty-nv commented 3 years ago

It seems like not all packages are showing the same behavior. I tried restarting the container and repeating the commands and got the same results, so it's not related to the internet connection. Maybe different packages use different ports for downloading? Could it be that the container by default has some issues with that?

If you are running the container with --network host, then all ports are passed through to the container. The jetson-inference container is run with this flag by default. It seems that you can download packages with apt but not with pip?

Noam-M commented 3 years ago

Yes. Using the l4t-ml container (with --network host) I am able to install opencv and scikit using apt-get but not pip Similarly using the Jetson-Inference container I am able to install scipy and scikit using apt-get but not pip

Using pip3 doesn't work for neither of them for some odd reason. Maybe it's related to an old pip version installed - 9.0.1. I tried to upgrade it but unfortunately I was also unsuccessful in doing that.

The problem is that I already get a dependency problem with skimage and numpy. I encounter the same dependency problem when installing scipy and then scikit on the Jetson-inference container:

from numpy.lib.arraypad import _validate_lengths
ImportError: cannot import name '_validate_lengths'

The installed packages I have are (got it using apt list --installed):

python3-skimage/bionic,now 0.13.1-2 all [installed]
python3-numpy/bionic,now 1:1.13.3-2ubuntu1 arm64 [installed]

This is strange since when listing the installed packages using pip I get: numpy (1.19.2) I tried to upgrade numpy using apt-get install python3-numpy --upgrade, but got this message: python3-numpy is already the newest version (1:1.13.3-2ubuntu1).

Noam-M commented 3 years ago

Hello @dusty-nv

I think that trying to install each package is difficult and I fear a dependency hell. What do you think about using the TVM compiler stack to compile trained models for deployment?\

Thanks again?

dusty-nv commented 3 years ago

@Noam-M I use TensorRT for model runtime without depending on PyTorch/sklearn/ect, it can import ONNX models.

Noam-M commented 3 years ago

@dusty-nv Is there a guide for using TensorRT? I have a pytorch model - how do I convert it to ONNX model in order for importing it?

I must say I was super excited about Jetson-Inference but now I fell a bit lost... I also still don't understand why pip doesn't work for me....

Many thanks

dusty-nv commented 3 years ago

You would need to use the torch.onnx functions - https://pytorch.org/docs/stable/onnx.html

Here is one such script that exports my PyTorch classification model to ONNX model: https://github.com/dusty-nv/pytorch-classification/blob/15c7d51cd6a1271e5ee6959c3f91fb95b6b54ff0/onnx_export.py

dusty-nv commented 3 years ago

Here is TensorRT guide: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#import_onnx_python

Noam-M commented 3 years ago

Thanks @dusty-nv

I've tried to look further into using TensorRT. It seems that I have to install TRTorch in order to be able to import it and use it. But, it doesn't have actual support for windows (which is the PC i'm using) It does have support for aarch64. So the question is whether it's a smart idea to install it on my Jetson and do the compilation straight on it? Will the Jetson handle it well? Nvidia TRTorch installation instructions TRTorch Github

Also, there is a TensorRT container But I'm not sure whether it's only for infernce or also for compilation. Furthermore, If I do have to install TRTorch from scratch, then It would probably make more sense to do it on either ther Jetson-inference container or the l4t-ml container - what do you think?

Thanks again

dusty-nv commented 3 years ago

TRTorch doesn't export to ONNX, it uses PyTorch and TensorRT directly. You don't need TRTorch to export the PyTorch model to ONNX, and you don't need TRTorch to use TensorRT to load the ONNX model. The TensorRT API itself supports loading ONNX models.

Also, on Jetson if you wanted to use one of the direct PyTorch->TensorRT approaches (without ONNX), I recommend torch2trt toolinstead: https://github.com/NVIDIA-AI-IOT/torch2trt

Also, there is a TensorRT container But I'm not sure whether it's only for infernce or also for compilation.

That container is for x86. On Jetson, TensorRT/CUDA/cuDNN are automatically in the l4t-base container image (along with it's derivates like l4t-pytorch and l4t-ml).

Noam-M commented 3 years ago

Hi @dusty-nv

I've been trying to work with ONNX following your recommendations.

I have successfully exported an ONNX model but it still has a bug that doesn't allow me to proceed Additionally, I have solved some prior bugs that I would like to get your input upon, to make sure that I didn't add anything that will affect the inference runtime.

### The bug I am currently facing: My network's input is 2 torch tensors, and I make sure to export it that way. But when loading the model using onnxruntime, I get an error, that the model only has one input (not a tuple of two)

the code is:

# create example image data
dataloader = BtsDataLoader(args, 'test')
with torch.no_grad():
  for _, sample in enumerate(dataloader.data):
    image = Variable(sample['image'].cuda())
    focal = Variable(sample['focal'].cuda())
    lpg8x8, lpg4x4, lpg2x2, reduc1x1, depth_est = model(image, focal)
    break

# export the model
input_names = [ "image" , "focal" ]
output_names = [ "lpg8x8", "lpg4x4", "lpg2x2", "reduc1x1", "depth_est"]

print('exporting model to ONNX...')
torch.onnx.export(model.module, (image, focal), rootDir + '/models/bts_nyu_v2_pytorch_densenet121/ONNX_model1', verbose=True, input_names=input_names, output_names=output_names, opset_version=13)

onnx_model = onnx.load(rootDir + '/models/bts_nyu_v2_pytorch_densenet121/ONNX_model1')
onnx.checker.check_model(onnx_model)
ort_session = onnxruntime.InferenceSession(rootDir + '/models/bts_nyu_v2_pytorch_densenet121/ONNX_model1')

# compute ONNX Runtime output prediction
ort_inputs = {"image": image.detach().cpu().numpy(), 
              "focal": focal.detach().cpu().numpy()}
ort_outs = ort_session.run(None, ort_inputs)

Then I get the following error: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Invalid Feed Input Name:focal

Can you please help me solve this? It seems as if the exporting had an issue since ort_session.get_inputs() outputted a list of length 1

### The bugs I have solved but I'm still worried about:

  1. My torch model uses thetorch.nn.DataParallel(model) scheme. but running it on the ONNX exporter gives me the following error: torch.nn.DataParallel is not supported by ONNX exporter My solution was to simply export it using model.module but I hope it doesn't affect the runtime I'm pretty sure it's only relevant during training, but again, I would like to make sure.

  2. At first I got this error message while trying to export: Exporting the operator repeat_interleave to ONNX opset version 9 is not supported. Please feel free to request support or submit a pull request on PyTorch GitHub. Since the repeat_interleave function is not super complicated I have implemented it myself and then it exported nicely. My worry is that I would assume that the pytorch implementation is probably faster than mine so I hope it didn't affect the timing. This one is a bit less worrysome than the previous bug but still I want to make sure it's ok.

Thank you so much!

Noam-M commented 3 years ago

Hello @dusty-nv

Would you be able to please help me out with using 2 tensors as my network's input?

Thank you so much!

dusty-nv commented 3 years ago

Hi @Noam-M, sorry I haven't used onnxruntime before, so I'm not much assistance using it.

I wonder if ONNX Runtime thinks there is one input that is actually a tuple - i.e. it didn't break up the tuple into two separate inputs.

Noam-M commented 3 years ago

Hello @dusty-nv

I previously thought that onnxruntime is the default go-to for deploying onnx models, but it appears I was wrong. So now I am trying to use TensorRT as you recommended.

First, I have tried to use trtexec as explained in the tensorrt documentation But I got an error message: bash: trtexec: command not found This is strange and so I checked in /usr/src/tensorrt/bin and I do have the trtexec file there

Do you think it could be related to the fdact that I cant seem to use pip install commands?

I would greatly appreciate your help. Thanks

Noam-M commented 3 years ago

Hey @dusty-nv

Do you have any idea how to solve this problem?

Thanks again!

dusty-nv commented 3 years ago

@Noam-M you should either add /usr/src/tensorrt/bin to your PATH, or execute trtexec from that directory

Parham-khj commented 3 years ago

Thanks @dusty-nv for the advice! I tried installing scipy inside the container but got the following error:

root@nvidia:/# pip3 install scipy --verbose
Collecting scipy
  1 location(s) to search for versions of scipy:
  * https://pypi.python.org/simple/scipy/
  Getting page https://pypi.python.org/simple/scipy/
  Looking up "https://pypi.python.org/simple/scipy/" in the cache
  Returning cached "301 Moved Permanently" response (ignoring date and etag information)
  Looking up "https://pypi.org/simple/scipy/" in the cache
  No cache entry available
  Starting new HTTPS connection (1): pypi.org
  Incremented Retry for (url='/simple/scipy/'): Retry(total=4, connect=None, read=None, redirect=None, status=None)
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scipy/
  Starting new HTTPS connection (2): pypi.org
  Incremented Retry for (url='/simple/scipy/'): Retry(total=3, connect=None, read=None, redirect=None, status=None)
  Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scipy/
  Starting new HTTPS connection (3): pypi.org
  Incremented Retry for (url='/simple/scipy/'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
  Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scipy/
  Starting new HTTPS connection (4): pypi.org
  Incremented Retry for (url='/simple/scipy/'): Retry(total=1, connect=None, read=None, redirect=None, status=None)
  Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scipy/
  Starting new HTTPS connection (5): pypi.org
  Incremented Retry for (url='/simple/scipy/'): Retry(total=0, connect=None, read=None, redirect=None, status=None)
  Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/scipy/
  Starting new HTTPS connection (6): pypi.org
  Could not fetch URL https://pypi.python.org/simple/scipy/: connection error: HTTPSConnectionPool(host='pypi.org', port=443): Max retries exceeded with url: /simple/scipy/ (Caused by ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)) - skipping
  Could not find a version that satisfies the requirement scipy (from versions: )
Cleaning up...
No matching distribution found for scipy
Exception information:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pip/basecommand.py", line 215, in main
    status = self.run(options, args)
  File "/usr/lib/python3/dist-packages/pip/commands/install.py", line 353, in run
    wb.build(autobuilding=True)
  File "/usr/lib/python3/dist-packages/pip/wheel.py", line 749, in build
    self.requirement_set.prepare_files(self.finder)
  File "/usr/lib/python3/dist-packages/pip/req/req_set.py", line 380, in prepare_files
    ignore_dependencies=self.ignore_dependencies))
  File "/usr/lib/python3/dist-packages/pip/req/req_set.py", line 554, in _prepare_file
    require_hashes
  File "/usr/lib/python3/dist-packages/pip/req/req_install.py", line 278, in populate_link
    self.link = finder.find_requirement(self, upgrade)
  File "/usr/lib/python3/dist-packages/pip/index.py", line 514, in find_requirement
    'No matching distribution found for %s' % req
pip.exceptions.DistributionNotFound: No matching distribution found for scipy

I could try to download the wheels and then manually install it but then I am afraid I will have version collisions, and that could create a mess.

Also, in order to use the l4t-ml container - do I have to build it locally? I tried to clone in and run the docker_run.sh and I got this error:

nvidia@nvidia:~/jetson-containers$ ./scripts/docker_run.sh 
localuser:root being added to access control list
Unable to find image 'nvcr.io/nvidian/nvidia-l4t-base:r32.4' locally
docker: Error response from daemon: unauthorized: authentication required.
See 'docker run --help'.

instead of cloning and running, I tried pulling and running, using the following command:

sudo docker pull nvcr.io/nvidia/l4t-ml:r32.4.4-py3
sudo docker run -it --rm --runtime nvidia --network host -v /home/user/project:/location/in/container nvcr.io/nvidia/l4t-ml:r32.4.4-py3

Running the contatiner itself actually did work but now it couldn't find the cv2 package

What is the correct way to use the l4t-ml container? Also I would like to use the TensorRT optimizations and I don't know if the l4t-ml container has that option ready

So I'm not sure what's the best way to proceed? installing packages manually to the jetson-inference, to the l4t-ml container? maybe not using a container at all and trying another method?

Thanks again

Hey @dusty-nv I ran docker/run.sh and have got docker container which is working correctly. I want to install new Python packages. Could add new " RUN ..." line into the Dockerfile ? if yes, How I can I rebuilt new container from existing one. Should I run docker/build.sh ?? Thanks

dusty-nv commented 3 years ago

You can make a new Dockerfile and have FROM nvcr.io/nvidia/l4t-ml:r32.5.0 (or whatever your L4T version is for the tag)

Then add your RUN commands in your new Dockerfile

Then build it with docker build

Noam-M commented 3 years ago

Hello @dusty-nv

Thanks for your help. I added trtexec to my path and was able to work with it. Unfortunately, I had errors attempting to use it: I used: trtexec --onnx=<my_path>/ONNX_model --explicitBatch

and the relevant error in the output: ModelImporter.cpp:125: Clip_524 [Clip] inputs: [1425 -> (1, 1, 60, 80)], [2162 -> ()], [optional input, not set], ERROR: builtin_op_importers.cpp:362 In function importClip: [8] Assertion failed: inputs.at(2).is_weights() && "Clip max value must be an initializer!"

I found this github Issue relevant to my problem. The answer recommends to "set the max value in the Clip layer to convert it from ONNX to TensorRT engine. You can edit the ONNX graph to achieve this."

I don't know how to edit the onnx graph and I didn't find proper documentation for that, so I thought I would just export the model from pytorch to onnx again, while fixing the issue at its core. Another problem is that I don't have an explicit Clip layer in my pytorch model. Checking the output I get while exporting the pytorch model into ONNX and I found 3 places (out of a large 1640 piece of output) that had the following form:

%1667 : Float(1, 1, 120, 160, strides=[19200, 19200, 160, 1], requires_grad=1, device=cuda:0) = onnx::ReduceL2[axes=[1], keepdims=1](%1666) # /usr/local/lib/python3.7/dist-packages/torch/functional.py:1420:0
%1669 : Tensor? = prim::Constant()
%1671 : Float(1, 1, 120, 160, strides=[19200, 19200, 160, 1], requires_grad=1, device=cuda:0) = onnx::Clip(%1667, %2232, %1669) # /usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:4273:0

But I don't know how to find the part of the network that creates this.

Do you have any ideas / suggestions? Are there any good tutorials / courses that you suggest that can address these issues methodically?

Thanks!

Noam-M commented 3 years ago

Hello @dusty-nv

I tried to overcome the last error by modifying the onnx graph explicitly. Since the clip only had a minimum and not a maximum, it didn't work well, but I figured using THIS that if i delete the "min" input, it should work fine, and it did! Also, I got similar results exporting the model again, this time with opset=13, (rather than opset-=11)

But now I got a different erorr:

ModelImporter.cpp:125: Unsqueeze_392 [Unsqueeze] inputs: [1277 -> ()], [1286 -> (1)], 
terminate called after throwing an instance of 'std::out_of_range'
  what():  Attribute not found: axes
Aborted (core dumped)

Looking closely at the graph using Netron, shows, that the axes (which is labeled as input [1286 -> (1)]) is alive and well: Screenshot from 2021-04-14 19-35-52

So I wonder why did the ModelImporter (by running trtexec) could not find it. I tried to run this command, using the JupyterLab: print([node for node in graph.node if node.output[0] == '1286']) And got the following output:

[output: "1286"
name: "Constant_391"
op_type: "Constant"
attribute {
  name: "value"
  t {
    dims: 1
    data_type: 7
    raw_data: "\000\000\000\000\000\000\000\000"
  }
  type: TENSOR
}
]

But, I tried to find the same constant using Netron, and it didn't find it in the graph. I don't have much experience with netron but I assume that It just doesn't show the constants.

Any ideas? Thanks!