Closed Armandpl closed 11 months ago
maybe I could export my model in the onnx format and load it as is on the jetson? in which case I might be able to avoid installing torch? if I can avoid torch maybe I can have poetry set everything up?
maybe I should have a setup.sh script to do stuff like install poetry, disable the GUI etc
decision: regarding dependencies I think the easiest would be to switch to a jetson orin and use poetry. might be worth looking into containers but idk I haven't found a good example of a robotic system using containers yet.
decision: for now let's just implement a 0mq bus
requirements: Act as an excellent embedded engineer. I am building an autonomous 1/10th rc car to compete in a race. The control software is running on a jetson nano on the car. I would like to build a bus (similar to a can bus) to split my code into modules, make it readable, easy to debug and robust (handle bugs, crashes, edge cases). Here are my requirements
Please ask any clarifying questions, let me know if anything among the requirements doesn't make sense (I'm not an expert here). Then answer the questions in the requirements, then write the code. When writing the code, explain what zmq patterns you are using (e.g Suicidal Snail Pattern, Clone pattern etc...)
doing some profiling, the nn takes ~21ms for inference and the pre-processing takes ~15.8ms so we can't inference at 30 fps (33ms max)
Line # Hits Time Per Hit % Time Line Contents
==============================================================
20 @profile
21 def main():
22 1 0.0 0.0 0.0 camera = CSICamera(
23 1 0.0 0.0 0.0 width=640,
24 1 0.0 0.0 0.0 height=360,
25 1 0.0 0.0 0.0 capture_width=1280,
26 1 0.0 0.0 0.0 capture_height=720,
27 1 925.9 925.9 2.7 capture_fps=30,
28 )
29
30 1 0.1 0.1 0.0 print("go")
31 1000 2.0 0.0 0.0 for i in range(1000):
32 1000 17475.2 17.5 51.0 image = camera.read()
33 1000 15845.7 15.8 46.3 image = preprocess(image)
34
35 1 0.1 0.1 0.0 print("done")
Total time: 15.773 s
File: bench_preprocessing.py
Function: preprocess at line 43
Line # Hits Time Per Hit % Time Line Contents
==============================================================
43 @profile
44 def preprocess(image):
45 1000 17.2 0.0 0.1 image = image[CROP_TOP:CROP_TOP+CROP_H, CROP_LEFT:CROP_LEFT+CROP_W]
46 1000 8198.5 8.2 52.0 image = PIL.Image.fromarray(image)
47 1000 2873.7 2.9 18.2 image = transforms.functional.resize(image, (224, 224))
48 1000 4352.4 4.4 27.6 image = transforms.functional.to_tensor(image).cuda().half()
49 1000 306.9 0.3 1.9 image.sub_(mean[:, None, None]).div_(std[:, None, None])
50 1000 24.4 0.0 0.2 return image[None, ...]
Converting from np.array to pil back to tensor seems wasteful? is it doing the bgr to rgb conversion? let's save the pil image and see?
Line # Hits Time Per Hit % Time Line Contents
==============================================================
20 @profile
21 def main():
22 1 0.0 0.0 0.0 camera = CSICamera(
23 1 0.0 0.0 0.0 width=640,
24 1 0.0 0.0 0.0 height=360,
25 1 0.0 0.0 0.0 capture_width=1280,
26 1 0.0 0.0 0.0 capture_height=720,
27 1 877.6 877.6 2.6 capture_fps=30,
28 )
29
30 1 0.1 0.1 0.0 print("go")
31 1000 2.7 0.0 0.0 for i in range(1000):
32 1000 22356.4 22.4 65.4 image = camera.read()
33 1000 10963.5 11.0 32.1 image = preprocess(image)
34
35 1 0.1 0.1 0.0 print("done")
Total time: 10.8717 s
File: bench_preprocessing.py
Function: preprocess at line 54
Line # Hits Time Per Hit % Time Line Contents
==============================================================
54 @profile
55 def preprocess(image):
56 # Crop
57 1000 18.2 0.0 0.2 image = image[CROP_TOP:CROP_TOP+CROP_H, CROP_LEFT:CROP_LEFT+CROP_W]
58
59 # Convert BGR to RGB
60 1000 252.1 0.3 2.3 image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
61
62 # Convert to tensor
63 1000 35.0 0.0 0.3 image = torch.from_numpy(image)
64 1000 3019.9 3.0 27.8 image = transforms.functional.convert_image_dtype(image, torch.float32)
65
66 # Move channel dimension to the beginning
67 1000 55.0 0.1 0.5 image = image.permute(2, 0, 1)
68
69 1000 5071.4 5.1 46.6 image = transforms.functional.resize(image, (224, 224))
70 1000 2037.1 2.0 18.7 image = image.cuda().half()
71
72 1000 353.8 0.4 3.3 image.sub_(mean[:, None, None]).div_(std[:, None, None])
73
74 1000 29.2 0.0 0.3 return image[None, ...]
Ok so seems like I can crop and resize using nvidconv
. It was already resizing the camera stream so we were in fact resizing twice.
return (f'nvarguscamerasrc sensor-id={self.capture_device} ! video/x-raw(memory:NVMM), width={self.capture_width}, height={self.capture_height}, '
f'format=(string)NV12, framerate=(fraction){self.capture_fps}/1'
f' ! nvvidconv top={self.CROP_TOP} bottom={self.CROP_BOTTOM} left={self.CROP_LEFT} right={self.CROP_RIGHT} ! video/x-raw, width=(int){self.width}, height=(int){self.height}, '
'format=(string)BGRx ! videoconvert ! appsink')
Line # Hits Time Per Hit % Time Line Contents
==============================================================
20 @profile
21 def main():
22 1 0.0 0.0 0.0 camera = CSICamera(
23 1 0.0 0.0 0.0 width=224,
24 1 0.0 0.0 0.0 height=224,
25 1 0.0 0.0 0.0 capture_width=1280,
26 1 0.0 0.0 0.0 capture_height=720,
27 1 869.5 869.5 2.5 capture_fps=30,
28 )
29
30 1 0.2 0.2 0.0 print("go")
31 1000 2.2 0.0 0.0 for i in range(1000):
32 1000 27819.8 27.8 81.4 image = camera.read()
33 1000 5492.3 5.5 16.1 image = preprocess(image)
34
35 1 0.1 0.1 0.0 print("done")
Total time: 5.40918 s
File: bench_preprocessing.py
Function: preprocess at line 52
Line # Hits Time Per Hit % Time Line Contents
==============================================================
52 @profile
53 def preprocess(image):
54 # Crop
55 # image = image[CROP_TOP:CROP_TOP+CROP_H, CROP_LEFT:CROP_LEFT+CROP_W]
56
57 # Convert BGR to RGB
58 1000 120.4 0.1 2.2 image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
59
60 # Convert to tensor
61 1000 37.6 0.0 0.7 image = torch.from_numpy(image)
62 1000 2770.1 2.8 51.2 image = transforms.functional.convert_image_dtype(image, torch.float32)
63
64 # Move channel dimension to the beginning
65 1000 52.2 0.1 1.0 image = image.permute(2, 0, 1)
66
67 # image = transforms.functional.resize(image, (224, 224))
68 1000 2071.5 2.1 38.3 image = image.cuda().half()
69
70 1000 327.8 0.3 6.1 image.sub_(mean[:, None, None]).div_(std[:, None, None])
71
72 1000 29.6 0.0 0.5 return image[None, ...]
So now 5.5ms + 21 ms = ~37Hz which should be enough for what we do. Could maybe shave more ms out of the pre-processing by using dusty-nv/jetson-utils Or we could use a smaller network, e.g squeezenet but in sim 30Hz is enough to control the car so this should be good. I need to profile adding a GRU on top of the resnet, but I guess it should still stay doable at 30Hz, if not then squeezenet I guess
pre-commit doesn't run on the jetson nano, idk why maybe python 3.6. need to find a way to lint, sort import etc
The current embedded code is a single spaghetti Python script. It served us well and its time to let it go.
The ideal system has a few components:
Now I have no idea how to write this. Couple of ideas:
Exploratory tasks: