commaai / comma10k

10k crowdsourced images for training segnets
MIT License
661 stars 360 forks source link

An attempt to achieve similar performance in lesser parameters for segmentation #2988

Closed neel04 closed 1 year ago

neel04 commented 2 years ago

Hey, thanks to the wonderful work done by @YassineYousfi, I was able to get a simple incremental improvement to achieve similar validation metrics with nearly half as many parameters. Base repo: https://github.com/neel04/PrEDAtor-200

for re-creating Yassine's baselines, I had to modify the code for API changes (and general Colab shenanigans 😛) This was the commit which were used to re-train those experiments in 256x256 image size.

Because my experiments are a mess of Colab, Kaggle notebooks and various scripts, I forked Yassine's repo and added some simple changes to obtain the improvements and keep things neat&tidy. Here's the fork.

While it may be unfeasible for Comma to trade-off the limited compute for memory, I believe it still stands as an interesting experiment to train more parameter and hardware-friendly models for deployment and reducing computational costs.

I'm a high schooler and would love to intern at Comma. There's another self-driving related project I wanted to pursue, however, I need some advice for that. Would you be willing to have a quick chat sometime @geohot so I can flesh out my proposal properly?

Many thanks ❤️ Neel

YassineYousfi commented 1 year ago

This looks good. A few comments:

Closing.

neel04 commented 1 year ago

Hey, thanks for the pointers. I would definitely agree that measuring inference throughput would have been a much more accurate measure than raw parameters alone - but I suppose the rationale was for saving memory too, and simple filters require very little of it :)

I did train the VQ-VAE, and linked a cherry-picked sample in my fork, here the constructions are remarkably good, but the small dataset size may create problems for OOD frames.

Currently, I'm comparing some other models with the SOTA on Comma2k19, already being 7% (unnormalized) mean % error up on the raw AP metrics (ensuring same held out set); which already is quite a huge margin above OP-Supercombo (I believe around ~15%+7%?). https://www.veed.io/view/65d4e419-9507-4e4f-97d9-0544fdec171e is a cool visualization of the bare-bones ResNet baseline (I haven't got around to rendering visualizations of the better models yet) with a 'few' of my changes. Interestingly, e2e lat seems much easier for the models to grasp that long - in certain intersections, I've noticed the models predicting to be dangerously close to the stationary cars.

The whole point of that project is to study how scaling models and data affects their capability to handle OOD scenarios, especially ones where OP fails heavily on - and as such aren't exactly relevant to Comma due to the compute limits posed by the C3 preventing large models. Nevertheless, If I do obtain some interesting results I'll be sure to drop them on the discord.

Best, N

YassineYousfi commented 1 year ago

Sounds good. You can post on #topic-custom-models if you trained your own models or continue the discussion there.

Some hints: Have you tried using your model somewhere, GTA or Carla? Do you think it can drive given that it predicts the path so accurately?

neel04 commented 1 year ago

I bet it can muddle along GTA5 fine enough because it wouldn't be really OOD; I get where you're coming from - I'm doing semi-supervised learning here but that would still introduce biases and priors that might cause problems for actual deployment, such as not facing noisy controls and handling weird light conditions (which requires extensive augmentation, something I see you've been working on), or effectively learning to recover when off-centred from lanelines.

However, this model wouldn't actually be deployed anytime soon due to compute constraints as I mentioned, and tackling above such problems, which Comma has solved but are unwilling to share any part of their pipeline with. That said, models at scale, especially Language Models, are able to somewhat extrapolate and are significantly more robust to their smaller counterparts - an effect of the scale they operate on.

It makes much more sense to talk about it in context of bayesian priors guiding the model towards better hypotheses. Think of each parameter in the NN and their individual interactions are directing the flow of information through the network - that hypothesis space is extremely large, and we often use our trusty SGD+backprop to guide us through that. However, when you initialize the model that's inadvertently introducing a prior, randomly selected, which may account for some disparity in the performance.

Larger models have a much larger hypothesis space to explore, and thus can locate more effective hypotheses given more data and compute. Overparameterization also forces you to adopt more and more general priors, counterintuitively - you may have heard of the grokking phenomena. If you can find such priors and embed them, analogously how this paper does where they take parts of the trained model, specific heads they identify as associative and manipulative heads after training and add it to randomly initialized model, they recover most of the performance.

To make my point quicker, such larger models such in theory learn would learn such basic manoeuvers such as automatically learning to centre to lanelines even if they're slightly off. This was a recent paper by DeepMind which made quite a lot of news - you can explictly train these LLMs to meta-learn, which is quite exciting if it was applied to FSD at some point.

At the pitiable scale Comma's current models are, they're extremely saturated with the priors it can learn from the vast dataset. Problem being, none of those priors are general because the model has NO incentive to, Its fitting to mere statistical regularities - hence why it can't differentiate a roadside sign to an object, and there come the crashes. e2e won't work at such scale.

A good example is NLP - its pure behavioral cloning. Even the smallest GPT1-125M can barely hold a conversation which resembles a slightly drunk human. You obtain impressive capabilities at the Billion parameter scale. Not a handful of millions.

You could try throwing more data and more epochs, just to obtain diminishing results - mere percentage points not even worth discussing about because you've saturated its limits, You could try exploring other Conv-based archs as I'm trying to do, but that again won't get you far. Language models are promising - training them has become extremely data efficient, and you can leverage vast amounts of techniques to prune and quantize a tiny model which would still outperform Convnets. They also learn architectural priors, like positional invariance despite it being a property of convolutions. It's simply because positional invariance is a highly useful prior for its tasks - and so there exist many for FSD, we just haven't found them yet. But that's the point of DL. There's a lot of unlock here - and I know they're insanely hard to deploy, easier to talk about than work with. But potentially a pot of gold.