dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.89k stars 2.99k forks source link

Retraining SSD-Mobilenet #649

Closed arjuntx2 closed 1 year ago

arjuntx2 commented 4 years ago

Hi,I just want to make something clear:

detectnet.py --model=models/fruit/ssd-mobilenet.onnx --labels=models/fruit/labels.txt \
          --input-blob=input_0 --output-cvg=scores --output-bbox=boxes \
            csi://0

In this argument, is detectnet.py same as detectnet-camera.py when I want to run from the live camera. If so, I am getting following error.

jetson.inference.__init__.py
jetson.inference -- initializing Python 2.7 bindings...
jetson.inference -- registering module types...
jetson.inference -- done registering module types
jetson.inference -- done Python 2.7 binding initialization
jetson.utils.__init__.py
jetson.utils -- initializing Python 2.7 bindings...
jetson.utils -- registering module functions...
jetson.utils -- done registering module functions
jetson.utils -- registering module types...
jetson.utils -- done registering module types
jetson.utils -- done Python 2.7 binding initialization
jetson.inference -- PyTensorNet_New()
jetson.inference -- PyDetectNet_Init()
jetson.inference -- detectNet loading network using argv command line params
jetson.inference -- detectNet.__init__() argv[0] = '/usr/local/bin/detectnet-camera.py'
jetson.inference -- detectNet.__init__() argv[1] = '--network=models/fruit/ssd-mobilenet.onnx'
jetson.inference -- detectNet.__init__() argv[2] = '--labels=models/fruit/labels.txt'
jetson.inference -- detectNet.__init__() argv[3] = '--input-blob=input_0'
jetson.inference -- detectNet.__init__() argv[4] = '--output-cvg=scores'
jetson.inference -- detectNet.__init__() argv[5] = '--output-bbox=boxes'

detectNet -- loading detection network model from:
          -- prototxt     NULL
          -- model        models/fruit/ssd-mobilenet.onnx
          -- input_blob   'data'
          -- output_cvg   'coverage'
          -- output_bbox  'bboxes'
          -- mean_pixel   0.000000
          -- mean_binary  NULL
          -- class_labels NULL
          -- threshold    0.500000
          -- batch_size   1

[TRT]   TensorRT version 7.1.0
[TRT]   loading NVIDIA plugins...
[TRT]   Plugin creator registration succeeded - ::GridAnchor_TRT
[TRT]   Plugin creator registration succeeded - ::NMS_TRT
[TRT]   Plugin creator registration succeeded - ::Reorg_TRT
[TRT]   Plugin creator registration succeeded - ::Region_TRT
[TRT]   Plugin creator registration succeeded - ::Clip_TRT
[TRT]   Plugin creator registration succeeded - ::LReLU_TRT
[TRT]   Plugin creator registration succeeded - ::PriorBox_TRT
[TRT]   Plugin creator registration succeeded - ::Normalize_TRT
[TRT]   Plugin creator registration succeeded - ::RPROI_TRT
[TRT]   Plugin creator registration succeeded - ::BatchedNMS_TRT
[TRT]   Could not register plugin creator:  ::FlattenConcat_TRT
[TRT]   Plugin creator registration succeeded - ::CropAndResize
[TRT]   Plugin creator registration succeeded - ::DetectionLayer_TRT
[TRT]   Plugin creator registration succeeded - ::Proposal
[TRT]   Plugin creator registration succeeded - ::ProposalLayer_TRT
[TRT]   Plugin creator registration succeeded - ::PyramidROIAlign_TRT
[TRT]   Plugin creator registration succeeded - ::ResizeNearest_TRT
[TRT]   Plugin creator registration succeeded - ::Split
[TRT]   Plugin creator registration succeeded - ::SpecialSlice_TRT
[TRT]   Plugin creator registration succeeded - ::InstanceNormalization_TRT
[TRT]   completed loading NVIDIA plugins.
[TRT]   detected model format - ONNX  (extension '.onnx')
[TRT]   desired precision specified for GPU: FASTEST
[TRT]   requested fasted precision for device GPU without providing valid calibrator, disabling INT8
[TRT]   native precisions detected for GPU:  FP32, FP16
[TRT]   selecting fastest native precision for GPU:  FP16
[TRT]   attempting to open engine cache file models/fruit/ssd-mobilenet.onnx.1.1.7100.GPU.FP16.engine
[TRT]   loading network profile from engine cache... models/fruit/ssd-mobilenet.onnx.1.1.7100.GPU.FP16.engine
[TRT]   device GPU, models/fruit/ssd-mobilenet.onnx loaded
[TRT]   Deserialize required 2046997 microseconds.
[TRT]   device GPU, CUDA engine context initialized with 3 bindings
[TRT]   binding -- index   0
               -- name    'input_0'
               -- type    FP32
               -- in/out  INPUT
               -- # dims  4
               -- dim #0  1 (SPATIAL)
               -- dim #1  3 (SPATIAL)
               -- dim #2  300 (SPATIAL)
               -- dim #3  300 (SPATIAL)
[TRT]   binding -- index   1
               -- name    'scores'
               -- type    FP32
               -- in/out  OUTPUT
               -- # dims  3
               -- dim #0  1 (SPATIAL)
               -- dim #1  3000 (SPATIAL)
               -- dim #2  3 (SPATIAL)
[TRT]   binding -- index   2
               -- name    'boxes'
               -- type    FP32
               -- in/out  OUTPUT
               -- # dims  3
               -- dim #0  1 (SPATIAL)
               -- dim #1  3000 (SPATIAL)
               -- dim #2  4 (SPATIAL)
[TRT]   INVALID_ARGUMENT: Cannot find binding of given name: data
[TRT]   binding to input 0 data  binding index:  -1
[TRT]   Parameter check failed at: engine.cpp::getBindingDimensions::1977, condition: bindIndex >= 0 && bindIndex < getNbBindings()
[TRT]   binding to input 0 data  dims (b=1 c=1 h=1 w=1) size=4
[TRT]   INVALID_ARGUMENT: Cannot find binding of given name: coverage
[TRT]   binding to output 0 coverage  binding index:  -1
[TRT]   Parameter check failed at: engine.cpp::getBindingDimensions::1977, condition: bindIndex >= 0 && bindIndex < getNbBindings()
[TRT]   binding to output 0 coverage  dims (b=1 c=1 h=1 w=1) size=4
[TRT]   INVALID_ARGUMENT: Cannot find binding of given name: bboxes
[TRT]   binding to output 1 bboxes  binding index:  -1
[TRT]   Parameter check failed at: engine.cpp::getBindingDimensions::1977, condition: bindIndex >= 0 && bindIndex < getNbBindings()
[TRT]   binding to output 1 bboxes  dims (b=1 c=1 h=1 w=1) size=4
device GPU, models/fruit/ssd-mobilenet.onnx initialized.
detectNet -- using ONNX model
detectNet -- maximum bounding boxes:  1
jetson.utils -- PyCamera_New()
jetson.utils -- PyCamera_Init()
[gstreamer] initialized gstreamer, version 1.14.5.0
[gstreamer] gstCamera attempting to initialize with GST_SOURCE_NVARGUS, camera 0
[gstreamer] gstCamera pipeline string:
nvarguscamerasrc sensor-id=0 ! video/x-raw(memory:NVMM), width=(int)1280, height=(int)720, framerate=30/1, format=(string)NV12 ! nvvidconv flip-method=0 ! video/x-raw ! appsink name=mysink
[gstreamer] gstCamera successfully initialized with GST_SOURCE_NVARGUS, camera 0
jetson.utils -- PyDisplay_New()
jetson.utils -- PyDisplay_Init()
[OpenGL] glDisplay -- X screen 0 resolution:  1920x1080
[OpenGL] glDisplay -- display device initialized
[gstreamer] opening gstCamera for streaming, transitioning pipeline to GST_STATE_PLAYING
[gstreamer] gstreamer changed state from NULL to READY ==> mysink
[gstreamer] gstreamer changed state from NULL to READY ==> capsfilter1
[gstreamer] gstreamer changed state from NULL to READY ==> nvvconv0
[gstreamer] gstreamer changed state from NULL to READY ==> capsfilter0
[gstreamer] gstreamer changed state from NULL to READY ==> nvarguscamerasrc0
[gstreamer] gstreamer changed state from NULL to READY ==> pipeline0
[gstreamer] gstreamer changed state from READY to PAUSED ==> capsfilter1
[gstreamer] gstreamer changed state from READY to PAUSED ==> nvvconv0
[gstreamer] gstreamer changed state from READY to PAUSED ==> capsfilter0
[gstreamer] gstreamer stream status CREATE ==> src
[gstreamer] gstreamer changed state from READY to PAUSED ==> nvarguscamerasrc0
[gstreamer] gstreamer changed state from READY to PAUSED ==> pipeline0
[gstreamer] gstreamer msg new-clock ==> pipeline0
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> capsfilter1
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> nvvconv0
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> capsfilter0
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> nvarguscamerasrc0
[gstreamer] gstreamer stream status ENTER ==> src
[gstreamer] gstreamer msg stream-start ==> pipeline0
GST_ARGUS: Creating output stream
CONSUMER: Waiting until producer is connected...
GST_ARGUS: Available Sensor modes :
GST_ARGUS: 2592 x 1944 FR = 29.999999 fps Duration = 33333334 ; Analog Gain range min 1.000000, max 16.000000; Exposure Range min 34000, max 550385000;

GST_ARGUS: 2592 x 1458 FR = 29.999999 fps Duration = 33333334 ; Analog Gain range min 1.000000, max 16.000000; Exposure Range min 34000, max 550385000;

GST_ARGUS: 1280 x 720 FR = 120.000005 fps Duration = 8333333 ; Analog Gain range min 1.000000, max 16.000000; Exposure Range min 22000, max 358733000;

GST_ARGUS: Running with following settings:
   Camera index = 0 
   Camera mode  = 2 
   Output Stream W = 1280 H = 720 
   seconds to Run    = 0 
   Frame Rate = 120.000005 
GST_ARGUS: Setup Complete, Starting captures for 0 seconds
GST_ARGUS: Starting repeat capture requests.
CONSUMER: Producer has connected; continuing.
[gstreamer] gstCamera onPreroll
[gstreamer] gstCamera -- allocated 16 ringbuffers, 1382400 bytes each
[gstreamer] gstreamer changed state from READY to PAUSED ==> mysink
[gstreamer] gstreamer msg async-done ==> pipeline0
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> mysink
[gstreamer] gstreamer changed state from PAUSED to PLAYING ==> pipeline0
[gstreamer] gstCamera -- allocated 16 RGBA ringbuffers
[TRT]   ../rtSafe/cuda/genericReformat.cu (1294) - Cuda Error in executeMemcpy: 1 (invalid argument)
[TRT]   FAILED_EXECUTION: std::exception
[TRT]   detectNet::Detect() -- failed to execute TensorRT context
Traceback (most recent call last):
  File "/usr/local/bin/detectnet-camera.py", line 61, in <module>
    detections = net.Detect(img, width, height, opt.overlay)
Exception: jetson.inference -- detectNet.Detect() encountered an error classifying the image
jetson.utils -- PyDisplay_Dealloc()
jetson.utils -- PyCamera_Dealloc()
[gstreamer] closing gstCamera for streaming, transitioning pipeline to GST_STATE_NULL
GST_ARGUS: Cleaning up
CONSUMER: Done Success
GST_ARGUS: Done Success
PyTensorNet_Dealloc()

Thank you for the hepl :)

dusty-nv commented 4 years ago

When you type that command into the console, are you running it with the \ forward slashes at the end of the lines?

If so, remove those and just run it on one line.

dusty-nv commented 4 years ago

Also, does it work when you run it as detectnet.py, but not detectnet-camera.py?

Can you try running it with the --debug flag added to get more info?

arjuntx2 commented 4 years ago

Yes, I am running with \ forward slashes. I will remove it and give it a try :)

Although there is no file named detectnet in Pytorch-ssd directory when I cloned it. Do I need to copy file detectnet-camera.py from build/aarch64/bin ? Because I am seeing bash error : No command found named detectnet.py when I run the command in Pytorch-Ssd directory.

Thanks

dusty-nv commented 4 years ago

You need to do sudo make install. Maybe it is running the old detectnet-camera that hadn't been updated?

arjuntx2 commented 4 years ago

I will get back to you soon. Thanks alot

arjuntx2 commented 4 years ago

As You have mentioned in the document :

To classify some static test images, we'll use the extended command-line parameters to detectnet (or detectnet.py) to load our custom SSD-Mobilenet ONNX model. To run these commands, the working directory of your terminal should still be located in: jetson-inference/python/training/detection/ssd/

If I go to this directory, In my case, there is no such a file named detectnet or detectnet.py in that directory

So I get this error :

bash: detectnet: command not found

Soin that directory, is it still supposed to run the following command line? Am I making any mistake here?

detectnet --model=models/fruit/ssd-mobilenet.onnx --labels=models/fruit/labels.txt \ --input-blob=input0 --output-cvg=scores --output-bbox=boxes \ "images/fruit*.jpg" test_fruit

In my case I copied files detectnet-console and detectnet-camera from build/aarch64/bin/ and pasted it in jetson-inference/python/training/detection/ssd/

After that, I removed back slash as you suggested and I was able to create engine running detecnet-console

I trained it for Boy/Girl data and it is not detecting anything.

*Error:detectnet-console: writing 1067x1600 image to 'test_fruit' [image] invalid extension format '.test_fruit' saving image 'test_fruit' [image] valid extensions are: JPG/JPEG, PNG, TGA, BMP, and HDR. detectnet-console: failed saving 1067x1600 image to 'test_fruit'

Although, the example image has .jpg format.

Thank you for answering again :)

dusty-nv commented 4 years ago

If I go to this directory, In my case, there is no such a file named detectnet or detectnet.py in that directory

When you do sudo make install, these get installed to /usr/local/bin, which means they should run from any directory.

You should run sudo make install from your jetson-inference/build directory.

In my case I copied files detectnet-console and detectnet-camera from build/aarch64/bin/ and pasted it in jetson-inference/python/training/detection/ssd/

You could also do the reverse, run the programs from jetson-inference/build/aarch64/bin, and then adjust the paths to your custom model. It doesn't actually need to be run from jetson-inference/python/training/detection/ssd/, it just makes the paths to your custom model shorter.

torabshaikh commented 4 years ago

Hi @dusty-nv , is it possible to retrain the model for only two new classes on some other computer and use it on Jetson? I have tried using DIGITS locally but I kept running into one problem after another without running a single epoch.

dusty-nv commented 4 years ago

Hi Torab, DIGITS is only supported on x86 (not Jetson), and the DIGITS portion of the Hello AI World tutorial is deprecated.

You can run the same PyTorch training code on PC running Ubuntu though (as long as PC has NVIDIA GPU). Just install PyTorch and torchvision on your PC first.


From: Torab Shaikh notifications@github.com Sent: Sunday, July 26, 2020 12:04:00 PM To: dusty-nv/jetson-inference jetson-inference@noreply.github.com Cc: Dustin Franklin dustinf@nvidia.com; Mention mention@noreply.github.com Subject: Re: [dusty-nv/jetson-inference] Retraining SSD-Mobilenet (#649)

Hi @dusty-nvhttps://github.com/dusty-nv , is it possible to retrain the model for only two new classes on some other computer and use it on Jetson? I have tried using DIGITS locally but I kept running into one problem after another without running a single epoch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dusty-nv/jetson-inference/issues/649#issuecomment-664006397, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVEGK7L7VHRCQAGR4G3OJLR5RHXBANCNFSM4PDSNLYQ.

drmnasr commented 4 years ago

Hello, @dusty-nv I actually tried this, i retrained on an EC2 instance, afterwards i took the pth file, converted it to onnx on a jetson nano and it didnt run, i got "killed" after "Graph construction and optimization completed in 0.103095 seconds." in the first run, couldn't create the engine. I tried the same exactly on my Xavier NX and that worked fine, the model isn't that large, 10 epochs on 2 classes of open images, batch size 128 though.

could that be the reason?

dusty-nv commented 4 years ago

batch size 128 though.

Yes, increased batch size will also increase the memory, so you may want to try small batch size for Nano. Or mount swap on Nano for when it is building the TensorRT engine.

drmnasr commented 4 years ago

@dusty-nv thank you for your reply, i tried again just 1 epoch with batch size 32 on the EC2 instance, then transferred to the nano and mounted swap 4G...same issue, i will try with batch size 12 and update you.

dusty-nv commented 4 years ago

Actually, the training batch size is independent of the ONNX batch size (that should always be 1). So must be something else...

drmnasr commented 4 years ago

then what can it be? it was a fresh install for the nano, works perfectly fine with the existing models for detectnet, and my custom model works fine on the xavier

dusty-nv commented 4 years ago

Does it work with this model? https://nvidia.box.com/shared/static/gq0zlf0g2r258g3ldabl9o7vch18cxmi.gz

Also, when you did a fresh install on the Nano, it was of JetPack 4.4, right?

drmnasr commented 4 years ago

nope, just tested it, same thing, getting killed. just works with the pre-trained ones included in this repo, i tried 3 custom trained so far. and yes fresh install of 4.4

Screen Shot 2020-08-10 at 11 47 32 PM
dusty-nv commented 4 years ago

Can you run sudo tegrastats in another terminal to keep an eye on the memory usage? 'Killed' normally means out of memory... interesting, because it does load on my Nano here.

drmnasr commented 4 years ago

seems to be out of memory yes, but why? the model is small and the pre-trained ones work

Screen Shot 2020-08-10 at 11 54 02 PM
dusty-nv commented 4 years ago

I'm not sure why it happens, since it doesn't run out of memory on my Nano. Can you try disabling the Ubuntu UI and rebooting?

To disable GUI - https://askubuntu.com/a/1056371

Then with GUI disabled, try running on a single test image. If it works, you can then re-enable the GUI because the TensorRT engine will already be built for next time.

drmnasr commented 4 years ago

That actually worked! but for about 5 mins, then it is frozen in the tactic, i will leave it for a while, tegrastats shows 3300+ utilization, i will update you if it changes, if not, i will restart it

drmnasr commented 4 years ago

I left it for hours, it is frozen in the middle of preparing the engine, anything else i can do to free more memory?

dusty-nv commented 4 years ago

Hmm not sure why it isn't working for you.

Can you try running sudo systemctl disable nvzramconfig? Then reboot (and remount your 4GB swap if you need to)

dusty-nv commented 4 years ago

@drmnasr I think I found a way to work around this - comment out these lines:

https://github.com/dusty-nv/jetson-inference/blob/627b5890b49449573b7c1af8de22ce985fc395e4/c/detectNet.cpp#L112

//if( modelTypeFromPath(model) == MODEL_ONNX )
//      mWorkspaceSize = 2048 << 20;

Then re-run make and sudo make install

torabshaikh commented 4 years ago

@drmnasr We were facing a similar problem with one of our Jetson Nano. I was using detectnet.py to load the model and it was taking hours to build the engine file and that too after failing multiple times. I switched to detectnet command from Python when loading the model first time. It built the engine file then switched to python to use the model.

drmnasr commented 4 years ago

@dusty-nv I can never thank you enough. worked perfectly!! thank you!

Rafipaulino commented 3 years ago

Hi @dusty-nv Has the original question in the post finally been resolved? I am having similar problems....INVALID_ARGUMENT: Cannot find binding of given name: coverage.

Any idea about what could be happening?

dusty-nv commented 3 years ago

I am having similar problems....INVALID_ARGUMENT: Cannot find binding of given name: coverage.

This seems like a different issue. It isn't getting the custom layers names. Are you sure you have the following command line arguments included? (check for typos)

--input-blob=input_0 --output-cvg=scores --output-bbox=boxes

Rafipaulino commented 3 years ago

Oh sorry, my mistake!! You were right. It was a typo problem. Thanks!!!