dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.89k stars 2.99k forks source link

How to add Voice capability during inference in SSD Model #1089

Closed saifurrehman4114 closed 3 years ago

saifurrehman4114 commented 3 years ago

Hello @dusty-nv ,

Hope you are fine.

I have trained the SSD model using my own dataset for object detection.

I want to ask how to add voice capability like when it detects the object speaker should say the label of that object during inference.

Currently the command for inference in bash shell I am using for the Inference is their any argument to add also:

detectnet --model=models/dir/ssd-mobilenet.onnx --labels=models/dir/labels.txt --input-blob=input_0 --output-cvg=scores --output-bbox=boxes /dev/video0

saifurrehman4114 commented 3 years ago

@dusty-nv should I add the library of c++ text to speech in the docker or can I use some kind of python library of Pyttsx3 add it into the correct file of the detect net as I am unable to find the required file.

https://github.com/dusty-nv/jetson-inference/blob/627b5890b49449573b7c1af8de22ce985fc395e4/c/detectNet.cpp#L112

Kindly can you tell me the solution as I have to submit my final year project before 15 June. That's why

I have tried to edit the jetson-inference-python-examples detectnet.py by adding pysttx3, but the model is not found:

code is as follows:

!/usr/bin/python3

#

Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

#

Permission is hereby granted, free of charge, to any person obtaining a

copy of this software and associated documentation files (the "Software"),

to deal in the Software without restriction, including without limitation

the rights to use, copy, modify, merge, publish, distribute, sublicense,

and/or sell copies of the Software, and to permit persons to whom the

Software is furnished to do so, subject to the following conditions:

#

The above copyright notice and this permission notice shall be included in

all copies or substantial portions of the Software.

#

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR

IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,

FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL

THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER

LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING

FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER

DEALINGS IN THE SOFTWARE.

#

import jetson.inference import jetson.utils

import argparse import sys

custom

import pyttsx3

parse the command line

parser = argparse.ArgumentParser(description="Locate objects in a live camera stream using an object detection DNN.", formatter_class=argparse.RawTextHelpFormatter, epilog=jetson.inference.detectNet.Usage() + jetson.utils.videoSource.Usage() + jetson.utils.videoOutput.Usage() + jetson.utils.logUsage())

parser.add_argument("input_URI", type=str, default="", nargs='?', help="URI of the input stream") parser.add_argument("output_URI", type=str, default="", nargs='?', help="URI of the output stream") parser.add_argument("--network", type=str, default="ssd-mobilenet-v2", help="pre-trained model to load (see below for options)") parser.add_argument("--overlay", type=str, default="box,labels,conf", help="detection overlay flags (e.g. --overlay=box,labels,conf)\nvalid combinations are: 'box', 'labels', 'conf', 'none'") parser.add_argument("--threshold", type=float, default=0.5, help="minimum detection threshold to use")

is_headless = ["--headless"] if sys.argv[0].find('console.py') != -1 else [""]

try: opt = parser.parse_known_args()[0] except: print("") parser.print_help() sys.exit(0)

load the object detection network

net = jetson.inference.detectNet(opt.network, sys.argv, opt.threshold)

create video sources & outputs

input = jetson.utils.videoSource(opt.input_URI, argv=sys.argv) output = jetson.utils.videoOutput(opt.output_URI, argv=sys.argv+is_headless)

adding text to speech custom

def text_speech(detections):

engine = pyttsx3.init()
engine.say(f'{detections}')
engine.runAndWait()

process frames until the user exits

while True:

capture the next image

img = input.Capture()

# detect objects in the image (with overlay)
detections = net.Detect(img, overlay=opt.overlay)

#txt to speech call
    text_speech(detections)

# print the detections
print("detected {:d} objects in image".format(len(detections)))

for detection in detections:
    print(detection)

# render the image
output.Render(img)

# update the title bar
output.SetStatus("{:s} | Network {:.0f} FPS".format(opt.network, net.GetNetworkFPS()))

# print out performance info
net.PrintProfilerTimes()

# exit on input/output EOS
if not input.IsStreaming() or not output.IsStreaming():
    break
dusty-nv commented 3 years ago

Hi @saifurrehman4114, since pyttsx3 is a python library, I would customize detectnet.py.

You can see my follow-up to your forum post here: https://forums.developer.nvidia.com/t/how-to-add-voice-capability-during-inference-in-ssd-model/179999/5?u=dusty_nv