Improve Deployed Model's Autonomous Driving Behavior

@zfhall You've probably noticed that I haven't updated this repo in a little while. That's mainly because I've hit a wall on how to improve the car's disengagement event frequency (when I have to intervene because the model has made a poor driving decision). In particular, the car does poorly on sharp turns and doesn't know how to correct itself once it's gotten in a bad situation (it only sees the current frame, so it has very limited context). You've gotten far enough in the project that I figured you might have some good ideas.

Some of my thoughts:

Camera Positioning: Did I poorly position my camera? How did you position yours? Does your car handle sharp turns well? Other toy self-driving car projects position the camera higher up so that the camera has more of a birds-eye-view, looking immediately down on the front hood of the car and the road that the car is already on top of. For aesthetic purposes I kept my camera angle low, but that means the car only sees parts of the road that are more than about a foot away. This means that on sharp turns where the car is on the outside of the track, the camera is mostly looking at background and not road even when all wheels are still on the road.
Localization / Mapping: This is how real self-driving cars work and therefore has obvious appeal. Google's cars have detailed maps of their surroundings and know precisely where the car is within that map. If the car makes a mistake and winds up way off on the side of the road, the localization system essentially tells the car the corrections to get back on track. It basically involves making a map of the track in real-time. The downside is that I basically have no idea how to do localization (at least not yet). It's not really machine learning. It's more of a robotics concept. I would also have to install an accelerometer on the Pi. On the other hand, it's used in autonomous drones, which is pretty cool.
Multiple Cameras: Nvida's paper talks about using three cameras to cope with the "recovering from mistakes" issue. Quoting from their paper:

Training with data from only the human driver is not sufficient. The network must learn how to recover from mistakes. Otherwise the car will slowly drift off the road. The training data is therefore augmented with additional images that show the car in different shifts from the center of the lane and rotations from the direction of the road. Images for two specific off-center shifts can be obtained from the left and the right camera. Additional shifts between the cameras and all rotations are simulated by viewpoint transformation of the image from the nearest camera.

Better Models: I've thought of training an CNN-RNN. This is kind of a long shot, but the idea is that since my car can only see far ahead of it, to know where it is in the track now would require looking back a few frames. RNNs are good at looking back at previous points in time. I already use CNNs. CNN-RNNs have been used for Google image captioning and at other places for OCR.
Steering Angle Instead of Arrow Keys: My car is small and simple and so I resorted to arrow keys. I might have gotten better performance if I had used steering angle. This would involve a major code re-write. On the other hand, I've seen a lot of other toy cars use arrow-key steering just fine.

@RyanZotti great post! I'll reply properly when I have the time to give a decent reply as there is lots to talk about regarding this. So please know I haven't missed your post.

Cheers,

Zak

@RyanZotti In regards to “disengagement event frequency" I seem to have a similar problem. I just track tested a slightly modified version of your deep conv net with batch norm (98.4% accuracy from 40 Epochs) and had surprisingly mediocre results. My car performs relatively well on the turns (full U-turns) sometimes taking the turns early sometimes, late. It does better on left and turns than right, which I think is down to a poor/bias dataset. Surprisingly, I would say the car performs worse on the straights which again I put down to a poor dataset. For example the amount of time I would actually be only pressing “up” on the straights is minimal due, to me correcting the cars position left or right. This does however mean it is sometimes quite good at correcting itself when it has gone over the edges during straights. I hope I’m making sense here, it’s hard to describe the track performance in words. Perhaps if I get the chance I will upload a video to YouTube.

Something else I would like to mention is that I deliberately started the car off of the track in some training situations. This coupled with me continuing to drive the car round a corner even if I over/understeered means that it does have a limited ability to “correct” itself when veering off track whilst cornering… sometimes.

The majority of these issues will likely be somewhat improved by collecting more training data. My dataset is slightly smaller than I previously stated at only 70k frames rather than 100k. I will be collecting more varied data soon with the aim of doubling my dataset size.

In summary I believe that if one is aiming to purely use machine vision to control the car round the track, it’s important to carefully consider driving style and scenario variety while collecting data (I know I’m stating the obvious here).

Please tell me if you disagree or can expand on any of the above.

In regards to some of your ideas…

Camera Positioning: perhaps your camera angle could do with some improvement if you aren’t capturing the immediate road in front. I tried to mount mine high up, with a downwards angled view that is aligned to have the front of the car just out of the shot (see image below). the car probably sees the track that's 10-20cm in front of the car. In hindsight I would have perhaps mounted it even higher and used some kind of wide angle lens to get more data in the frame.

This brings me to camera type: I have heard of people using fish eye lenses for their scaled down self-driving cars which sound like a good idea on paper. I know that some self-driving cars use a combination of visible light, stereo vision, monocular, and infrared cameras. The stereovision and monocular cameras are used for visual odometry to calculate the orientation for localisation.

Localization / Mapping: This is what I would like to eventually implement too as it seems the most promising! Combining an accelerometer, stereo vison camera, LIDAR (not yet as its too expensive an bulky), and GPS (for outdoor use) could produce a robust localisation system. However I know next to nothing about this too so I can’t add much here. Saying that, I knew nothing about machine learning or python 6 months ago so I’m going to give it a go at some point. The GPS idea really interests me because if one were to couple this with ultrasonic sensors for collision avoidance (and all of the above if possible), the car could be given a specific destination to travel to. Like one side of a park to another for example. Again, this isn’t a machine learning problem right?

Multiple Cameras: I’m trying to rap my head around how this works. What is assigned to the additional “shifted” images? The same steering angle as the centre image? I need to spend more time looking into this!

Steering Angle Instead of Arrow Keys: I originally wanted to do this with my car however after looking into it I thought it would be way too time consuming for the first go at this project. I guess one would also need to use some kind of joystick/games console controller to collect data with. I will try this at some point though, because it would probably give much higher levels of driving precision… latency allowing. Also would having more outputs slow the speed of the Neural Net?

Better Models (CNN-RNN): I think you should give the CNN-RNN a go as it sounds like an interesting possible solution to your problem. And because machine learning is one of your expertise, why not give this a go before anything else? Especially seeing as some of the other options require gathering new data!

I have a few of ideas for improvement in regards to your current Deep-CNN with batch norm’, and would like to hear your opinion. They are as follows:

Decaying learning rate – I know that some people regard this as pointless when using ReLu however I have seen others gain minor improvements when using it. Specifically @martin-gorner on his mnist CNN.
LeakyRelu or Maxout to prevent dead neurons
Applying dropout to the fully connected layers. Although it doesn’t appear that the model is massively overfitting based on accuracy, so maybe this is overkill.
Go even deeper… again probably overkill and would take even longer to train and process frames.

Using sound: I had a thought that was inspired by real self-driving cars that was to use sound to detect road surface conditions. For example, if you have a track surface that was either rougher or smoother than the surrounding surface you could easily detect when the car is on or off the track. With two microphones on each side of the car, possibly facing down on the chassis/floor, it would be possible to tell when only one wheel has left the track boundary. This could be an input for the NN to learn from. I’m not sure what kind of microphone would be best suited for this job though, it’s likely not something you could just buy off the shelf.

I know this was a long post and I rambled a lot so thanks for reading this far. I hope it’s all clear and apologies if there are some big gaps in my knowledge, I haven’t been in this game for that long!

@zfhall Have you tried flipping the images about a vertical axis? That could solve your bias towards a particular side. I flip all my images (even the straight ones, which still wind up with a command of straight), so it effectively doubles my dataset and adds some diversity.

For better or worse it's good to hear that you're facing similar issues. I've never really had someone I could compare notes with, so this is great for me.

My roads are so narrow, and my deep model has such bad latency (because of its complexity) that all it really takes is one bad turn for it to get in a really tough spot. It's smart of you to use the paper to mark the border of the road rather than use the paper to act as the road itself. I live in a tiny apartment, so I'm highly space constrained and don't have that option unfortunately. I have to use the paper as the road. I do think it's a good idea to start the car in an off-track position so that it learns to correct. I've tried that quite a bit. My latest runs were on tracks of nearly only sharp turns so that essentially the entire circular dataset is the car trying to continuously correct.

I'm surprised that you're still seeing some erratic behavior despite your superb accuracy of 98.4%. I've only hit that level for training accuracy, not validation. It seems your model's validation accuracy no longer reflective of deployment accuracy. That sounds like a classic overfitting problem.

Speaking of overfitting, there are many images that, when I freeze them out of context after the fact, I can't tell whether to go a particular direction or go straight even as a human. There are lots of "in-between" cases of, for example, a soft left that the model has to arbitrarily assign to hard left or dead-on-straight because of my use of arrow keys. There are enough nuanced images in my data sets that I'm skeptical any model could realistically get above, say, 70-80% (deployment) accuracy on discrete labels like that. Not that there is an error/bug in the model code, just that we're seeing a massive overfitting problem that we're not able to properly diagnose because of data homogeneity.

It's probably also worth keeping in mind that real world cars need many, many hours of driving to reach exceptional accuracy. I've driven my car for about 2 hours, but in the grand schema of things that's pretty much nothing.

I do think I'm going to at least increase my camera angle. The wide lens sounds like a good idea too. The infrared cameras might be an interesting idea, though my car is so small and space-constrained that I probably have to choose between more sensors or cameras but not both. I'm a bit skeptical on the sound part. Paper on the floor is going to have such a low profile that it's probably indistinguishable from noise. You have some good model tuning ideas, and I'll try them out. If overfitting is the real issue though, going deeper / more layers (for example) might just make the problem worse. Dropout could help with overfitting a bit, though I think I've tried it in the past and didn't see noticeable results. I'm really concerned the CNN-RNN will just overfit.

The more I think about it the more I like the Nvidia idea of multiple cameras (I could use several tiny USB pinhole / spy cameras) and turn radius as a target variable. Anyways, Nvidia explicitly called out drifting as an issue in their paper, which seems to closely fit the problem I have, and nearly all real-world car implementations (even the ones that do localization) use turn radius and multiple forward-facing cameras. It's just unfortunate that I would have to do a major code refactoring. I do at least have the car's toy joystick that I could hardwire.

Now that I think about it, if the model is strongly overfitting like I suspect, it should in theory be able to at least overfit to the same exact track it was trained on. My dataset is pretty diverse (I've used various obstacles, furniture, etc). Have you tried driving it around, immediately training on (only) the data you just created, and then deploying right after to the same track? I might try that this weekend just as a means of diagnosis.

@RyanZotti Yes my images are also flipped which makes the bias even more strange. However come to think of it I remember the car struggling most with one specific right hand corner, rather than having an overall left hand bias.

To give some background, my entire dataset was collected on the same oval track. This same track is the one that I deploy the model on. So I think that answers the question in your second comment. The only changes that would occur between the test data and the track used for deployment would be lighting (although I collected data in a range of lighting conditions), background objects, and perhaps the most significant, the exact position of the track on the floor. I taped together the pieces of paper so that the shape of my track would not change, but the position may slightly differ every time I lay the track on the floor. This is significant because I have wooden flooring that has patterns that the model likely learned. Hence maybe over fitting is still an issue here. I never immediately deployed on the exact same track position as the test data, but have on the exact track shape. I'd like to here your results from doing this so please do let me know.

During autonomous driving do you use GPU or CPU to run the model? I have noticed latency reduction when using CPU but this is only through observation. How do you measure the models latency?

I too am surprised at the poor deployment accuracy that i'm getting in comparison the the validation figure. But I remain sceptical about the extremely high validation accuracy that i'm getting. I really do think my dataset is too small and thus the model is heavily over fitted even though the deployment track is so similar. I think i'm going to collect more data with the track placed in a different room where the wood floor grain pattern runs in a perpendicular direction to previously.

Speaking of overfitting, there are many images that, when I freeze them out of context after the fact, I can't tell whether to go a particular direction or go straight even as a human. There are lots of "in-between" cases of, for example, a soft left that the model has to arbitrarily assign to hard left or dead-on-straight because of my use of arrow keys. There are enough nuanced images in my data sets that I'm skeptical any model could realistically get above, say, 70-80% (deployment) accuracy on discrete labels like that. Not that there is an error/bug in the model code, just that we're seeing a massive overfitting problem that we're not able to properly diagnose because of data homogeneity.

Its very interesting what you say about the human aspect and data homogeneity. Perhaps steering angle labels would solve this problem. Maybe we will find out some point later down the line when one of us implements it. Do you think that theoretically with enough data, the right model and near negligible model and network latency, that this problem could also be solved? Because if we as humans are able to control the car round the track with a simple hard right, hard left, and straight then the right model can too right? In other words, if the model can process say 100 frames a second and act on them, maybe the limited outputs wouldn't matter as much? I know that in our case this isnt possible due to hardware limitations but its just a thought.

Tweaking the camera angle and lens sounds like a good move for your issue. Esepcially seeing as you use the paper as the track rather than the edges.

I'm a bit skeptical on the sound part. Paper on the floor is going to have such a low profile that it's probably indistinguishable from noise.

Yes in our case this would not work without a modified track.

You have some good model tuning ideas, and I'll try them out. If overfitting is the real issue though, going deeper / more layers (for example) might just make the problem worse. Dropout could help with overfitting a bit, though I think I've tried it in the past and didn't see noticeable results. I'm really concerned the CNN-RNN will just overfit.

I'll give the LeakyRelu and dropout a go just out of interest as I've never used them in a model before. But your right, I doubt any of these changes will make a significant difference to deployment performance.

The more I think about it the more I like the Nvidia idea of multiple cameras (I could use several tiny USB pinhole / spy cameras) and turn radius as a target variable. Anyways, Nvidia explicitly called out drifting as an issue in their paper, which seems to closely fit the problem I have, and nearly all real-world car implementations (even the ones that do localization) use turn radius and multiple forward-facing cameras. It's just unfortunate that I would have to do a major code refactoring. I do at least have the car's toy joystick that I could hardwire.

This does seem very promising though. I think it would be worth it in the end, even if it doesn't improve the performance much at least you could then rule it out.

Its great to hear your thoughts on all this an have someone to share ideas with. This project you have created is really something! The amount of time it takes to do something like this from the ground up is no joke. So cheers!

@zfhall I use a CPU during deployment since I have a Mac, which doesn't have an Nvidia (and thus Tensorflow-supported) GPU. I do use AWS GPUs during training, and while there is no doubt I'd increase network latency with the extra hop to AWS if I also used GPUs in deployment, the reduction in model/Tensorflow execution time (the obvious bottleneck at the moment) would probably result in a noticeable overall speed-up. I'll definitely try that. I use AWS spot instances, so I get like 80-90% cost reductions, so it's not price-prohibitive.

It's a temporary part of my code, so probably not in the repo, but I'll often print out timestamps after each prediction. The print usually comes right after the feed_dict part of the model code. I just track the difference between subsequent end-to-end time between prediction results. Nothing fancy.

Interesting comment about possibly overfitting to the wood on the floor. The original reason I had looked into Tensorboard 1-2 years ago was to do deconvolution and essentially visualize what the model had learned. You can use deconvolution to find the features that are most activated by certain images. In theory, you could use this to detect if those features (in your case) are the grains of wood on your floor. The deconvolution didn't quite pan out in Tensorboard as I'd hoped, but you might get better results. Deconvolution has gotten really big in deep learning, so if you aren't familiar with it you should look into it (even if only in terms of knowing what it's capable of).

I do think the model could perform better with more data. By the way, I did some quick research on similar projects (don't know why I didn't do this earlier), and I came across some ideas that might work. See below.

MariFlow . This guy essentially trained a recurrent neural network how to play Mario Kart with TensorFlow. I guess he had already tried the CNN+RNN idea and it seemed to work pretty well. He also switched between autonomous/human driving modes post-deployment so that he could explicitly teach his car how to get out of its most common mistakes (because the mistakes would lead to situations far from what the model already knew). I'll probably try to implement the interactive switching. Also, it sounds like this guy is using discrete target variables (console keys which look basically like arrow keys), so maybe I don't have to switch to steering angle after all. Also, he had 15 hours of data, so we need a lot more than what we already have.
Online-learning . This is a really fascinating project. I'm absolutely stunned that he was able to teach the model on the fly using online learning and on the Pi, no less. It takes me hundreds of thousands of records (many epochs and many frames per epoch) on a GPU to get decent performance. It's impressive that he is able to see progress after just a few hours of driving / training. Also interesting is that he focuses on the interactive part of data collection / autonomous deployment. Like MariFlow, the interactive data gathering sessions seem to be pretty key.
DIYRobocars . This is basically a group of people that make self-driving toy cars that meets monthly to race their cars. I can't believe I hadn't heard of this group -- it would have made my life so much easier if I had talked to them months / years ago. They're based in San Francisco / Oakland (where I live) and their projects are way ahead of mine. I plan on going to their event next week and just asking people for advice. This seems like a solved problem. There is definitely a solution(s) to the issue that we're facing. I'll keep you updated.

@RyanZotti apologies for the (very) late reply, my workload has been intense recently! Thanks for the tips on measuring latency and using deconvolution! I think I will definitely try this out when I get the time. Also thanks for the info on these other projects. I actually came across both the "Online-learning" and "DIYRobocars" projects before, but it was when I knew close to nothing about machine learning so it was all very abstract to me. Did you manage to go to the DIYRobocars event? If so i would love to hear your experience/take-aways.

I discovered that part of my poor deployment performance was due to a bad dataset. It turns out i didn't conduct enough due diligence in regard to checking dataset quality. The problem I think I had is that when I was collecting data, I would sometimes transition between arrow keys very quickly. This meant that at points there was more than one arrow key pressed at a given time. The problem that arises from this was that the code would latch on to whatever the previous key was, and didn't recognise the key change (or so i think)! So there were lots of frames that should have been assigned a left or right, but instead were assigned an up. Workaround = drive with one finger :D! I now have a new and much cleaner dataset! I'll let you know how deployment is effected.

Have you experienced this at all?

@zfhall No problem, my workload has also been pretty heavy lately. I went to the DIYRobocars event, and it was awesome! I plan to go to all of their future events. There were about 10-15 independently implemented self-driving toy cars and we did time trials and racing in a large industrial warehouse. Unfortunately my power supply died on me as soon as I got to the event and so I couldn't even get my car to boot up, but I learned a lot of useful tips for making my car better after talking to other racers.

In particular, I found an implementation in DonkeyCar that makes it easy to do steering via a virtual joystick on your phone so that I can frame the target variable as continuous (magnitude of steering sharpness) rather than discrete (left, right, up arrow keys). If you scroll about half way down the page in the link above you can see a screenshot. I believe it uses HTLM5 (that's what the creator told me), so it doesn't require any understanding of iOS or Android code -- you just open up the web page served from a Tornado server running on your car, hit the green button on the web app to start recording, and then move your finger around in the light blue box to tell the car how fast to go or how tight to make a turn. I've already pulled the code that does that out of the DonkeyCar project and I plan to put it in my repo. I tested the front-end and it works like a charm. I just need to do a bit more integration with the rest of my repo before I publish the code. I'm also going to have to do a bunch of refactoring of some of my code that assumes discrete target variables. The other racers mentioned that using a continuous (rather than discrete) target variable made a big difference in terms of overall accuracy.

To answer your second question -- I haven't had that issue in particular, but it's something I've anticipated and I've always checked for it by running:

python play_numpy_dataset.py --datapath /Users/ryanzotti/Documents/repos/Self_Driving_RC_Car/data/144/predictors_and_targets.npz --show_arrow_keys y

after gathering a new set of data to make sure the recorded labels match up with what I intended. I also take my finger off the keys in between commands (though I do this very quickly).

@RyanZotti Glad to hear you enjoyed the event! Sorry to hear about your power supply issue though. That must have been frustrating.

Thanks for the "virtual joystick" link! I'll definitely try to implement this at some point. Using continuous target variables sounds very promising. I'm guessing this would drastically reduce the homogeneity in our datasets, in turn reducing overfitting? It would probably make the validation accuracy more representative of real-world accuracy. I'm also very interested in how this would effect convergence times.

I have trained my network on the new datasets and wow, what a difference that made to the accuracy! I am getting 100% on both training and validation accuracy. Bear in mind, however, that my dataset is not at all varied (collected on only one oval track/few corrective maneuvers), so this is probably a false positive. I'm hoping to deploy tomorrow so we'll see how it does.

I thought you may also be interested to hear that I implemented a version of your train_deep_convnet_batch_norm.py using five conv layers instead of seven, and one hidden FC layer instead of four, with good results. I noticed that the five conv layer network converged quicker (after 23 epochs compared to 33).

I have a question regarding your network mentioned above. What is the theory behind using a batch normalized conv layer with pooling, followed by a batch normalized cov layer without pooling? If you have any links to papers on this matter, please do share.

Edit: Also, why is the layer without pooling linear rather than nonlinear?

It works!! I Just deployed a convnet with batch normalization, using only two convolutional + pooling layers, trained on my "clean" dataset. At a slow enough speed, the car manages to complete several circuits of the track without disengaging. I never thought I'd get there! Thanks, @RyanZotti !

@zfhall Congrats!! That’s great to hear! It’s extremely satisfying when it finally works.

I’m still refactoring my code to use continuous targets and I’m also refactoring some of my data-related classes. I’ll try your smaller but better network architecture after my refactoring is done.

@RyanZotti Thanks mate! Ah yes, it would be interesting to see if you get similar results to me. I need to update my fork of your repo when I get the chance, then you can see what I have done.

Keep up the good work man, I can see you have been bust updating things 👍

RyanZotti / Self-Driving-Car

Improve Deployed Model's Autonomous Driving Behavior #116