[WIP 🚧] Object Detection Model

This is a WIP project, this README is it's bible and it will change over time

Goal & Motivation

The goal of this repo is not to create the best developer friendly object detectiol model

Instead of focusing on a family of different sizes models, I aim to train one middle size one (between 80M - 100M params) and provide (hopefully) quantized/smaller models using different techniques using that one model.

These are the main key points:

The main goal is that it must be easy to use and deploy, no research spaghetti code.

SOTA means nothing

SOTA on common datasets (COCO) means nothing. Most of the datasets have wrong labels, what I will focus on is to ensure the models has competitive performance on the most used research datasets and real-life ones.

The model has to be fast and easy to finetune

Current Limitations

Most of the current models have one or more of the following issue

Plan of attack

The plan of attack is the follow

Papers & Resources

Exploring Plain Vision Transformer Backbones for Object Detection
This papers implements a very simple fpn showin how you don't need hierchical features with ViTs. It also introduces a couple of tricks to work with big images, such as window partition

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
It shows how you can remove positional embeddings by using a depth wise 1x1 conv with zero padding in the MLP

What Makes for End-to-End Object Detection?
This paper propose a new loss that incorporate predicted location cost into the bboxes assignment, removing the need of nms.

Learning Transferable Visual Models From Natural Language Supervision
Clip paper, interesting enough, everybody in the research uses f*cking s*itty backbones. E.g. pretrained on IN. We will use the Clip's ViT

LoRA: Low-Rank Adaptation of Large Language Models
We will freeze the backbone and train only the neck. We will test out if for finetuning, adjusting the neck's weights with Lora is enough and/or also adjusting the backbone weights with Lora. The catch is that we will never train the backbone ever again fully.

