write robust embedded code

Armandpl commented 1 year ago

The current embedded code is a single spaghetti Python script. It served us well and its time to let it go.

The ideal system has a few components:

camera capturing images
neural net making predictions
logger logging sensor and control data
- could we also use it for profiling?
reading motor speed from arduino
pid to control the car speed
reading imu data
control module executing the trajectory?
- merge with pid?

Now I have no idea how to write this. Couple of ideas:

would be nice to have a bus to have components working at different frequency + cleaner more readable code
containers would be useful to have different dependencies per component
- need to scope what needs to be a container and what needs to be a process

Exploratory tasks:

[ ] try out cereal/zeromq on the jetson nano, can i have multiple process communicating
- multi processing could be good enough? maybe no need for containers and shit?
- or maybe just one container for nn stuff and the rest is one python codebase that we setup w/ poetry?
[ ] try out containers on jetson nano, how easy is it to download model from wandb, optimize and inference it?

Armandpl commented 1 year ago

offloading the whole car control to the xiao samd might be a decent idea. xiao samd seems robust? could just hook it up to the multiplexer
- this allows getting rid of dependencies on the jetson, we'll only need serial to control the car
- currently installing the adafruit library is somewhat annoying, requires a bunch of shit that doesn't really work with python 3.6
- this way maybe we could have no dependencies on the jetson? use tensorrt container with camera and that's it?

Armandpl commented 1 year ago

[ ] disable GUI to save compute

Armandpl commented 1 year ago

maybe I could export my model in the onnx format and load it as is on the jetson? in which case I might be able to avoid installing torch? if I can avoid torch maybe I can have poetry set everything up?

maybe I should have a setup.sh script to do stuff like install poetry, disable the GUI etc

Armandpl commented 1 year ago

decision: regarding dependencies I think the easiest would be to switch to a jetson orin and use poetry. might be worth looking into containers but idk I haven't found a good example of a robotic system using containers yet.

Armandpl commented 1 year ago

decision: for now let's just implement a 0mq bus

requirements: Act as an excellent embedded engineer. I am building an autonomous 1/10th rc car to compete in a race. The control software is running on a jetson nano on the car. I would like to build a bus (similar to a can bus) to split my code into modules, make it readable, easy to debug and robust (handle bugs, crashes, edge cases). Here are my requirements

use ZeroMQ, python multiprocessing and ipc as the transport layer
use a pub/sub architecture. each module should be able to subscribe to multiple publishers
here are the different modules:
- a PID module
- subscribes to speed sensor reading and speed commands
- computes throttle commands then publishes them
- a speed sensor module
- reads the speed sensor from a serial connection to and arduino and publishes speed sensor readings
- a car module
- subscribes to throttle and steering commands and sets them on the car
- a neural net module
- load neural net
- subscribes to processed images and then publishes a trajectory
- a control module
- subscribes to the output of the neural net module and publishes speed and steering commands
- a camera module
- reads camera, processes images publishes processed image
each module should load its config from disk using OmegaConf. each module should subscribe to a "reload" command to reload the config from disk when asked to. the main scripts should have some sort of interface to issue this command.
we should also be able to issue a quit command to properly terminate each process and clean up zmq stuff
I also want to monitor what's happening either real time or after the facts using PlotJuggler to debug things. We either need a module to write logs to disk or a module to stream the bus data over TCP to my laptop. What do you think is more robust? What do you think is easier to implement?
messages should not queue up on the sub side, if more than one message queues up it means we lagging behind and we won't be able to react to e.g obstacles on the road. Handle the latest message received, drop the others and print a warning if we're lagging behind.
I don't know what to use for the messages? protobuf? messagepack? capnp?
to allow multiple subs on pubs do we need a zeromq proxy?

Please ask any clarifying questions, let me know if anything among the requirements doesn't make sense (I'm not an expert here). Then answer the questions in the requirements, then write the code. When writing the code, explain what zmq patterns you are using (e.g Suicidal Snail Pattern, Clone pattern etc...)

Armandpl commented 1 year ago

doing some profiling, the nn takes ~21ms for inference and the pre-processing takes ~15.8ms so we can't inference at 30 fps (33ms max)

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    20                                           @profile
    21                                           def main():
    22         1          0.0      0.0      0.0      camera = CSICamera(
    23         1          0.0      0.0      0.0          width=640, 
    24         1          0.0      0.0      0.0          height=360, 
    25         1          0.0      0.0      0.0          capture_width=1280,
    26         1          0.0      0.0      0.0          capture_height=720,
    27         1        925.9    925.9      2.7          capture_fps=30,
    28                                               )
    29                                           
    30         1          0.1      0.1      0.0      print("go")
    31      1000          2.0      0.0      0.0      for i in range(1000):
    32      1000      17475.2     17.5     51.0          image = camera.read()
    33      1000      15845.7     15.8     46.3          image = preprocess(image)
    34                                           
    35         1          0.1      0.1      0.0      print("done")

Total time: 15.773 s
File: bench_preprocessing.py
Function: preprocess at line 43

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    43                                           @profile
    44                                           def preprocess(image):
    45      1000         17.2      0.0      0.1      image = image[CROP_TOP:CROP_TOP+CROP_H, CROP_LEFT:CROP_LEFT+CROP_W]
    46      1000       8198.5      8.2     52.0      image = PIL.Image.fromarray(image)
    47      1000       2873.7      2.9     18.2      image = transforms.functional.resize(image, (224, 224))
    48      1000       4352.4      4.4     27.6      image = transforms.functional.to_tensor(image).cuda().half()
    49      1000        306.9      0.3      1.9      image.sub_(mean[:, None, None]).div_(std[:, None, None])
    50      1000         24.4      0.0      0.2      return image[None, ...]

Converting from np.array to pil back to tensor seems wasteful? is it doing the bgr to rgb conversion? let's save the pil image and see?

Armandpl commented 1 year ago

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    20                                           @profile
    21                                           def main():
    22         1          0.0      0.0      0.0      camera = CSICamera(
    23         1          0.0      0.0      0.0          width=640, 
    24         1          0.0      0.0      0.0          height=360, 
    25         1          0.0      0.0      0.0          capture_width=1280,
    26         1          0.0      0.0      0.0          capture_height=720,
    27         1        877.6    877.6      2.6          capture_fps=30,
    28                                               )
    29                                           
    30         1          0.1      0.1      0.0      print("go")
    31      1000          2.7      0.0      0.0      for i in range(1000):
    32      1000      22356.4     22.4     65.4          image = camera.read()
    33      1000      10963.5     11.0     32.1          image = preprocess(image)
    34                                           
    35         1          0.1      0.1      0.0      print("done")

Total time: 10.8717 s
File: bench_preprocessing.py
Function: preprocess at line 54

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    54                                           @profile
    55                                           def preprocess(image):
    56                                               # Crop
    57      1000         18.2      0.0      0.2      image = image[CROP_TOP:CROP_TOP+CROP_H, CROP_LEFT:CROP_LEFT+CROP_W]
    58                                           
    59                                               # Convert BGR to RGB
    60      1000        252.1      0.3      2.3      image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    61                                           
    62                                               # Convert to tensor
    63      1000         35.0      0.0      0.3      image = torch.from_numpy(image)
    64      1000       3019.9      3.0     27.8      image = transforms.functional.convert_image_dtype(image, torch.float32)
    65                                           
    66                                               # Move channel dimension to the beginning
    67      1000         55.0      0.1      0.5      image = image.permute(2, 0, 1)
    68                                           
    69      1000       5071.4      5.1     46.6      image = transforms.functional.resize(image, (224, 224))
    70      1000       2037.1      2.0     18.7      image = image.cuda().half()
    71                                           
    72      1000        353.8      0.4      3.3      image.sub_(mean[:, None, None]).div_(std[:, None, None])
    73                                           
    74      1000         29.2      0.0      0.3      return image[None, ...]

Armandpl commented 1 year ago

Ok so seems like I can crop and resize using nvidconv. It was already resizing the camera stream so we were in fact resizing twice.

return (f'nvarguscamerasrc sensor-id={self.capture_device} ! video/x-raw(memory:NVMM), width={self.capture_width}, height={self.capture_height}, '
        f'format=(string)NV12, framerate=(fraction){self.capture_fps}/1'
        f' ! nvvidconv top={self.CROP_TOP} bottom={self.CROP_BOTTOM} left={self.CROP_LEFT} right={self.CROP_RIGHT} ! video/x-raw, width=(int){self.width}, height=(int){self.height}, '
        'format=(string)BGRx ! videoconvert ! appsink')

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    20                                           @profile
    21                                           def main():
    22         1          0.0      0.0      0.0      camera = CSICamera(
    23         1          0.0      0.0      0.0          width=224, 
    24         1          0.0      0.0      0.0          height=224, 
    25         1          0.0      0.0      0.0          capture_width=1280,
    26         1          0.0      0.0      0.0          capture_height=720,
    27         1        869.5    869.5      2.5          capture_fps=30,
    28                                               )
    29                                           
    30         1          0.2      0.2      0.0      print("go")
    31      1000          2.2      0.0      0.0      for i in range(1000):
    32      1000      27819.8     27.8     81.4          image = camera.read()
    33      1000       5492.3      5.5     16.1          image = preprocess(image)
    34                                           
    35         1          0.1      0.1      0.0      print("done")

Total time: 5.40918 s
File: bench_preprocessing.py
Function: preprocess at line 52

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    52                                           @profile
    53                                           def preprocess(image):
    54                                               # Crop
    55                                               # image = image[CROP_TOP:CROP_TOP+CROP_H, CROP_LEFT:CROP_LEFT+CROP_W]
    56                                           
    57                                               # Convert BGR to RGB
    58      1000        120.4      0.1      2.2      image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    59                                           
    60                                               # Convert to tensor
    61      1000         37.6      0.0      0.7      image = torch.from_numpy(image)
    62      1000       2770.1      2.8     51.2      image = transforms.functional.convert_image_dtype(image, torch.float32)
    63                                           
    64                                               # Move channel dimension to the beginning
    65      1000         52.2      0.1      1.0      image = image.permute(2, 0, 1)
    66                                           
    67                                               # image = transforms.functional.resize(image, (224, 224))
    68      1000       2071.5      2.1     38.3      image = image.cuda().half()
    69                                           
    70      1000        327.8      0.3      6.1      image.sub_(mean[:, None, None]).div_(std[:, None, None])
    71                                           
    72      1000         29.6      0.0      0.5      return image[None, ...]

So now 5.5ms + 21 ms = ~37Hz which should be enough for what we do. Could maybe shave more ms out of the pre-processing by using dusty-nv/jetson-utils Or we could use a smaller network, e.g squeezenet but in sim 30Hz is enough to control the car so this should be good. I need to profile adding a GRU on top of the resnet, but I guess it should still stay doable at 30Hz, if not then squeezenet I guess

Armandpl commented 1 year ago

pre-commit doesn't run on the jetson nano, idk why maybe python 3.6. need to find a way to lint, sort import etc

Armandpl / skyline

write robust embedded code #9