Visual-Behavior / detr-tensorflow

Tensorflow implementation of DETR : Object Detection with Transformers
MIT License
168 stars 53 forks source link

DETR : End-to-End Object Detection with Transformers (Tensorflow)

Tensorflow implementation of DETR : Object Detection with Transformers, including code for inference, training, and finetuning. DETR is a promising model that brings widely adopted transformers to vision models. We believe that models based on convolution and transformers will soon become the default choice for most practitioners because of the simplicity of the training procedure: NMS and anchors free! Therefore this repository is a step toward making this type of architecture widely available.

DETR paper: https://arxiv.org/pdf/2005.12872.pdf
Torch implementation: https://github.com/facebookresearch/detr

About this implementation: This repository includes codes to run an inference with the original model's weights (based on the PyTorch weights), to train the model from scratch (multi-GPU training support coming soon) as well as examples to finetune the model on your dataset. Unlike the PyTorch implementation, the training uses fixed image sizes and a standard Adam optimizer with gradient norm clipping.

Additionally, our logging system is based on https://www.wandb.com/ so that you can get a great visualization of your model performance!

Install

The code has been currently tested with Tensorflow 2.3.0 and python 3.7. The following dependencies are required.

wandb
matplotlib
numpy
pycocotools
scikit-image
imageio
pandas
pip install -r requirements.txt

Datasets

This repository currently supports three dataset formats: COCO, VOC, and Tensorflow Object detection csv. The easiest way to get started is to set up your dataset based on one of these formats. Along with the datasets, we provide a code example to finetune your model. Finally, we provide a jupyter notebook to help you understand how to load a dataset, set up a custom dataset, and finetune your model.

Tutorials

To get started with the repository you can check the following Jupyter notebooks:

As well as the logging board on wandb https://wandb.ai/thibault-neveu/detr-tensorflow-log and this report:

Evaluation

Run the following to evaluate the model using the pre-trained weights.

Checkout ✍ DETR Tensorflow - How to load a dataset.ipynb for more information about the supported dataset ans their usage.

python eval.py --data_dir /path/to/coco/dataset --img_dir val2017 --ann_file annotations/instances_val2017.json

Outputs:

       |  all  |  .50  |  .55  |  .60  |  .65  |  .70  |  .75  |  .80  |  .85  |  .90  |  .95  |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
   box | 36.53 | 55.38 | 53.13 | 50.46 | 47.11 | 43.07 | 38.11 | 32.10 | 25.01 | 16.20 |  4.77 |
  mask |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |  0.00 |
-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

The result is not the same as reported in the paper because the evaluation is run on the original image size and not on the larger images. The actual implementation resizes the image so that the shorter side is at least 800pixels and the longer side at most 1333.

Finetuning

To fine-tune the model on a new dataset we siply need to set the number of class to detect in our new dataset (nb_class). The method will remove the last layers that predict the box class&positions and add new layers to finetune.

# Load the pretrained model
detr = get_detr_model(config, include_top=False, nb_class=3, weights="detr", num_decoder_layers=6, num_encoder_layers=6)
detr.summary()

# Load your dataset
train_dt, class_names = load_tfcsv_dataset(config, config.batch_size, augmentation=True)

# Setup the optimziers and the trainable variables
optimzers = setup_optimizers(detr, config)

# Train the model
training.fit(detr, train_dt, optimzers, config, epoch_nb, class_names)

The following commands gives some examples to finetune the model on new datasets: (Pacal VOC) and (The Hard hat dataset), with a real batch_size of 8 and a virtual target_batch size (gradient aggregate) of 32. --log is used for logging the training into wandb.

python finetune_voc.py --data_dir /home/thibault/data/VOCdevkit/VOC2012 --img_dir JPEGImages --ann_dir Annotations --batch_size 8 --target_batch 32  --log

Checkout ✍ DETR Tensorflow - How to load a dataset.ipynb for more information about the supported dataset ans their usage.

python  finetune_hardhat.py --data_dir /home/thibault/data/hardhat --batch_size 8 --target_batch 32 --log

Training

(Multi GPU training comming soon)

python train_coco.py --data_dir /path/to/COCO --batch_size 8  --target_batch 32 --log

Inference

Here is an example of running an inference with the model on your webcam.

python webcam_inference.py 

Acknowledgement

The pretrained weights of this models are originaly provide from the Facebook repository https://github.com/facebookresearch/detr and made avaiable in tensorflow in this repository: https://github.com/Leonardo-Blanger/detr_tensorflow