facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.66k stars 2.46k forks source link

Recommendations for training Detr on custom dataset? #9

Open lessw2020 opened 4 years ago

lessw2020 commented 4 years ago

Very impressed with the all new innovative architecture in Detr! Can you clarify recommendations for training on a custom dataset? Should we build a model similar to demo and train, or better to use and fine tune a full coco pretrained model and adjust the linear layer to desired class count? Thanks in advance for any input.

Zumbalamambo commented 4 years ago

+1

PancakeAwesome commented 4 years ago

agree

alcinos commented 4 years ago

Hello, Thanks for your interest in DETR. It depends on the size of your dataset. If you have enough data (say at least 10K), training from scratch should work just fine. You'll need to prepare the data in the coco format and then follow instructions from the Readme. Note that if your dataset has a substantially different average number of objects per image than coco, you might need to adjust the number of object queries (--num_queries) It should be strictly higher than the max number of objects you may have to detect, and it's good to have some slack (in coco we use 100, the max number of objects in a coco image is ~70)

Fine-tuning should work in theory, but at the moment it's not tested/supported. If you want to give it a go anyways, you just need to --resume from one of the checkpoint we provide. Feel free to report back any results you obtain :)

Best of luck

raviv commented 4 years ago

Hi,

When fine-tuning from model zoo, using my own dataset, how should I modify the number of classes? Loading the model fails (as expected) on:

RuntimeError: Error(s) in loading state_dict for DETR:
    size mismatch for class_embed.weight: copying a param with shape torch.Size([92, 256]) from checkpoint, the shape in current model is torch.Size([51, 256]).
    size mismatch for class_embed.bias: copying a param with shape torch.Size([92]) from checkpoint, the shape in current model is torch.Size([51]).

As I have 50 labels, and the checkpointed model has 91.

Thanks!

alcinos commented 4 years ago

If you just want to replace the classification head, you need to erase it before loading the state dict. One approach would be:

model = torch.hub.load('facebookresearch/detr', 'detr_resnet50', pretrained=False, num_classes=50)
checkpoint = torch.hub.load_state_dict_from_url(
            url='https://dl.fbaipublicfiles.com/detr/detr-r50-e632da11.pth',
            map_location='cpu',
            check_hash=True)
del checkpoint["model"]["class_embed.weight"]
del checkpoint["model"]["class_embed.bias"]
model.load_state_dict(checkpoint["model"], strict=False)

Best of luck.

cbasavaraj commented 4 years ago

It would be easier (or at least more standard practice) to first load the pre-trained model, and then replace the classification head.

lessw2020 commented 4 years ago

related question but how should we downgrade the query number for smaller classes ( in terms of continuing from the approach above)?
For example I only have 5 classes to detect and each image will have exactly 5 classes per image, so I was planning to run with queries = 12 instead of the default 100 (or should it be 5 if we know that's the max our images will ever have...)

I'm looking at model.query_embed with (100,256) and assume that is the right place to adjust but unclear. If we adjust via model.query_embed.num_embeddings=my_new_query_count, is that enough? (update - I'm working on this and the DETR model stores a self.num_queries as well, but this is only referenced later for segmentation.
But to be correct should update both model.num_queries and the model.query_embed.num_embeddings would need to be adjusted together...)

lessw2020 commented 4 years ago

Also wouldn't we want to re-init the weights in class_embed to normal or uniform after wiping the checkpoint weights to kick off the new training?

alcinos commented 4 years ago

If you're fine-tuning, I don't recommend changing the number of queries on the fly, it is extremely unlikely to work out of the box. In this case you're probably better off retraining from scratch (you can change the --num_queries arg from our training script).

As for the initialization of class_embed, the solution I posted above makes sure it is initialized as it should.

Best of luck

lessw2020 commented 4 years ago

Hi @alcinos - excellent, thanks tremendously for the advice here, esp on a Sat night.
I will try both fine tuning for now (with smaller dataset and will not touch num_queries) and from scratch as we'll have a larger dataset soon, and update here to share results. Thanks again!

raviv commented 4 years ago

My dataset has images of various sizes. Do I need to resize them to a specific size?

lessw2020 commented 4 years ago

My dataset has images of various sizes. Do I need to resize them to a specific size?

I can't answer definitively but if you look at the code in datasets/coco.py, you can see how they handled their image resizing for coco training. Basically they do random rescaling per the scales list, with the largest size dimension maxed at 1333: ` scales = [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]

if image_set == 'train':
    return T.Compose([
        T.RandomHorizontalFlip(),
        T.RandomSelect(
            T.RandomResize(scales, max_size=1333),
            T.Compose([
                T.RandomResize([400, 500, 600]),
                T.RandomSizeCrop(384, 600),
                T.RandomResize(scales, max_size=1333),

`

The colab example used a max size of 800, with half precision weights.

Thus if your images are all larger than 1333 in one dimension, then they'll all be resized below that with padding anyway.

Hopefully others can add more info here but hope this provides some starter info for you.

alcinos commented 4 years ago

My dataset has images of various sizes. Do I need to resize them to a specific size?

As was noted by @lessw2020, the images will be randomly resized in an appropriate range by our data-augmentation. The images will then be padded, so having different sizes is not an issue.

Thanks for wonderful work, What is your recommendation to use DETR for single object detection(e.g., scene text detection) datasets?

I'm not sure about the specifics of your dataset, but in general I'd say all the general advice provided in this thread apply to the case where there is only one object class.

raviv commented 4 years ago

@alcinos, @lessw2020 It seems that these resizes are for data augmentation when training. As I'm using my own dataloader and augmentations, my question is does the architecture (or implementation) expects images to have some maximum size? Thanks.

fmassa commented 4 years ago

@raviv no, the architecture doesn't expect a maximum size, but note that the Transformer Encoder is quadratic wrt the number of pixels in the feature map, so if your image is very large (like larger than 2000 pixels), you might face memory issues.

raviv commented 4 years ago

This is how my losses look like so far. Would love to get other's input on their attempt to train on DETR on custom datasets.

image

m-klasen commented 4 years ago

Hi, currently working with my custom dataset. Relatively small with ~2k Train, 400 valid images (32 video sequence clips) and only 4 classes with a maximum of 6 instances per image. For my first training attempt i set num_queries=20 and discared all transformer weights etc. I trained 400 epochs with apex fp16 at lr 1e-4 with a lr_drop to 1e-5 at 200. image Evaluation at ep400 gives me a mAP of 0.45 which i can benchmark against a known good MaskRCNN from my colleague who achieves 0.63 mAP. My questions now are, which are the primary reason for the weaker performance?

  1. More training? Better LR adjustments with a decay for example (hard to do with the first attempt when you are going in blind)?
  2. reduce num_queries further?
  3. class/bg loss coef adjustment?
  4. ...?
fmassa commented 4 years ago

@mlk1337 thanks for sharing the results!

I think you are at a good starting point. I would say that from the logs you might want to change the eos_coef a bit and try different values. I think the number of num_queries is ok, but the eos_coef probably needs to be adapted.

I don't know if using apex with fp16 affects something or not as I haven't tried, but maybe @szagoruyko can comment on this?

@raviv your training logs are very weird, it seems that the model stopped working at some point early in training. Are you using gradient clipping (it's on by default)

raviv commented 4 years ago

@fmassa I'm running with the default args. To keep things simple, I'm using 1 class and disabled all augmentations. The behavior was similar when training multiple classes and with aug enabled. To speed things up I'm using a subset of my dataset with 8K train and 2K test

alcinos commented 4 years ago

@mlk1337 with such a small dataset, I'd recommend trying to fine-tune the class head, while starting from a pre-trained encoder/decoder. You'll have to keep the 100 queries if you do that, but unless you're after very marginal speed improvement it shouldn't hurt.

tanulsingh commented 4 years ago

Hey , I wanted to fine tune DETR myself on custom datasets , But I am new to all , I have been using torchvision models all the time to fine tune on my dataset . I would be glad if someone shares a demo code for fine-tuning @alcinos

lessw2020 commented 4 years ago

@raviv - happy to share my training results but can you post your plot code for the graphs and I'll use that? Right now I just have text output as the detr plot_utils wasn't working (wasn't sure if I should debug that or just move it to tensorboard, looking at that now).
@mlk1337 - same question, can you share your plot code for the logs?

m-klasen commented 4 years ago

@tanulsingh I wrote quick gist on how you can modify DETR to finetune on your own coco-formatted dataset Link. Hope this helps.

m-klasen commented 4 years ago

@lessw2020 https://github.com/facebookresearch/detr/blob/5617b89475faa21d4010c81ee2533e34a06014b5/util/plot_utils.py#L20

changed to pd.DataFrame(pd.np.stack(df.test_coco_eval_bbox.dropna().values)[:, 1]).ewm(com=ewm_col).mean() worked for me (for bounding boxes)

raviv commented 4 years ago

@lessw2020 I'm using https://github.com/allegroai/trains/ to track training

lessw2020 commented 4 years ago

Thanks very much @raviv and @mlk1337 - here's my first two training runs, I used num queries = 12 (6 classes) and trained from scratch.
I modified eos_coef from .1 to .01 to compare. As you can see, training loss looks great but validation not doing so well. (one caveat though is I can't hflip b/c this is for medical and flipped = no no, so I will be adding in more augmentations which for EffDet made a big difference and may alone be the validation issue here.) I'm trying with higher queries now just as a fast check and will go 10x higher on eos_coef, and then will also compare the fine tuning only option (with default 100 queries) and then kick in augmentations. Anyway at least for train loss, it's learning rapidly and easily: detr-first-runs

raviv commented 4 years ago

@lessw2020 What does the dotted line represent?

lessw2020 commented 4 years ago

Here's fine tuning vs training from scratch - everything looks much better relatively. (not sure why test class error never changes though...need to review loss criterion?)

@raviv - dotted line represents test (validation) scores, sold is training scores.
fine-tuning-detr

lessw2020 commented 4 years ago

related question - has anyone written visualization code for viewing sample output images with bboxes during training (i.e. with and/or without gt boxes in same image)? edit - actually can leverage this code here for part of the visuals: https://github.com/plotly/dash-detr/blob/master/model.py https://github.com/plotly/dash-detr/blob/master/app.py

lessw2020 commented 4 years ago

lastly, here's fine tuning with detr101-dc5 - for <30 epochs curves look great. Still unclear what is happening with test_class_error and test_loss_ce (dotted lines = test):

detr-101dc5-fine-tune

raviv commented 4 years ago

@lessw2020 Re: visualizing sample output, most of the code is in the project's Colab notebook You would just have to adapt it to show GT bboxes as well.

ghost commented 4 years ago

Hi,

If I want to train on Openimages v6 dataset with 600 classes in 30 GB sets, is it recommended to train all the layers or just the classification head?

And, does the classification head consist of class_embed and bbox_embed or just the class_embed

Finally, if I set num_queries = 700, and 500 epochs, would that be alright?

m-klasen commented 4 years ago

@kratosld

Hi,

If I want to train on Openimages v6 dataset with 600 classes in 30 GB sets, is it recommended to train all the layers or just the classification head?

And, does the classification head consist of class_embed and bbox_embed or just the class_embed

Finally, if I set num_queries = 700, and 500 epochs, would that be alright?

Just remove class_embed.weight & class_embed.bias and keep the rest. Unless you literally have 700 items in each image, do not change num_queries and keep it at 100, this will give you 100 proposed boxes for each item. Changing num_queries will result in retraining the whole transformer, which is costly.

ghost commented 4 years ago

@kratosld

Hi, If I want to train on Openimages v6 dataset with 600 classes in 30 GB sets, is it recommended to train all the layers or just the classification head? And, does the classification head consist of class_embed and bbox_embed or just the class_embed Finally, if I set num_queries = 700, and 500 epochs, would that be alright?

Just remove class_embed.weight & class_embed.bias and keep the rest. Unless you literally have 700 items in each image, do not change num_queries and keep it at 100, this will give you 100 proposed boxes for each item. Changing num_queries will result in retraining the whole transformer, which is costly.

Alright, thank you!

lessw2020 commented 4 years ago

Just a quick update that I am getting really outstanding results on my object detection inference with DETR (via fine tuning the res101 model).
I still have oddities in no mAP score, and class loss test curve is stuck at 100% error, but in running the actual detections on test images today it just smoked EfficientDet by comparison (D0 and D1). I'm sure this is b/c DETR can understand relationships, which is a big leap for this diagnostic work where all the items are inter-related and was the key reason I was super fired up to switch to DETR as soon as I read about the transformer architecture.
Anyway just wanted to post a big thanks to @fmassa and @alcinos esp. both for the help in getting training going (and also inventing DETR), and @raviv and @mlk1337 for additional feedback here.
This is for malaria and covid work fyi, so it has real life impact. Thanks again! (note I'm not signing off here, still have lots more datasets to train and fix mAP etc. but did want to provide an update and thanks!)

MHI4 commented 4 years ago

Hello All, I am quite a beginner in python. My experience is only in MATLAB based training. I was wondering whether anyone enthusiastically prepare a Google Colab notebook for us to train on our Custom Dataset. It might help us to learn the sequential training and validation steps. I appreciate your contribution, @lessw2020 @mlk1337 @raviv @fmassa @alcinos. Thank you all in Advance.

lessw2020 commented 4 years ago

Hi @MHI4 - I can make a colab this weekend if no one beats me to it.

1 - Do you have a custom dataset I can use to test on and private email for testing it so we don't clog up this thread during dev/testing? My work datasets are private so I can't use those, or we can also pick a general smaller detection dataset for example.

2 - Did you already see @mlk1337 gist link to gist as that gives you the key steps needed to fine tune, though maybe a bit more knowledge to setup vs colab with pluggable params for a dataset.

MHI4 commented 4 years ago

Hello @lessw2020, Thank you in advance for your Colab notebook.

I have started with a small open dataset from here. I have re-annotated object (1 object) in MATLAB and converted it to XML files. Using voc2coco I've generated the json annotation files. All are uploaded here.

You can use this. It would be wonderful. May be you can add the command for voc2coco based XML to json conversion in the same Colab script for all users.

Thank You Again

lgvaz commented 4 years ago

@mlk1337

@lessw2020 https://github.com/facebookresearch/detr/blob/5617b89475faa21d4010c81ee2533e34a06014b5/util/plot_utils.py#L20

changed to pd.DataFrame(pd.np.stack(df.test_coco_eval_bbox.dropna().values)[:, 1]).ewm(com=ewm_col).mean() worked for me (for bounding boxes)

Do we have to do this when training with only bboxes (without masks)?

AlexAndrei98 commented 4 years ago

Hi @MHI4 - I can make a colab this weekend if no one beats me to it.

1 - Do you have a custom dataset I can use to test on and private email for testing it so we don't clog up this thread during dev/testing? My work datasets are private so I can't use those, or we can also pick a general smaller detection dataset for example.

2 - Did you already see @mlk1337 gist link to gist as that gives you the key steps needed to fine tune, though maybe a bit more knowledge to setup vs colab with pluggable params for a dataset.

I think a notebook to run on a custom dataset would be very helpful 🙌🏼🙌🏼

lgvaz commented 4 years ago

For those having problems training on custom datasets,

I'm writing a library that unifies a data API for object detection, I just finished a tutorial on how to use it with Detr here.

The project provides a very flexible API for custom datasets while still using Detr original source code for training, be sure to take a look =)


@MHI4 , @AlexAndrei98 I'm tagging you because you're interested in a tutorial (the source code of the link I shared is a notebook btw, so you can run that).

lessw2020 commented 4 years ago

I started on a colab notebook today to walk through fine tuning, though didn't get as far as I thought b/c have some design decisions to make re: easiest way to wrapper custom datasets and should we do the training with all the args right in the notebook or run it with the shell command as current. (I wrote my own class to handle, but might be easier way and I have trained both by setting all args in a notebook and with shell command. Personally I like having the args listed and all available so I think I'll proceed with that...) Here's a link to the start though it just ramps up to the real issues atm. https://github.com/lessw2020/training-detr/blob/master/training_detr_colab.ipynb

lessw2020 commented 4 years ago

I've made more progress on the colab for custom training - it's at the point of building the model/post-processor/criterion but I have a bit of a sticking point b/c num_classes in detr.py::build(args) is determined from dataset name.
I simply modified detr.py for my own training but that's not a good solution b/c it will break over time with new updates. I will open an issue and probably do a PR tonight if that's of interest, to close the loop on this so that simply passing in an args.num_classes is supported directly (i.e. defaults to 20 per the current code, but if not coco or coco_panoptic, it will adjust num_classes)? I think that's the cleanest solution w/o disrupting anything and avoid the need to manually edit detr.py.
detr-colab-training-progress

lgvaz commented 4 years ago

args.num_classes is supported directly

@lessw2020 In the example I shared I implemented exactly that, in a backwards compatible way:

def build(args):
    if args.num_classes is not None:
        num_classes = args.num_classes
    else:
        num_classes = 20 if args.dataset_file != 'coco' else 91
        if args.dataset_file == "coco_panoptic":
            num_classes = 250

If it's of interest I can do a PR, just let me know =)

AlexAndrei98 commented 4 years ago

How does it compare in terms of time to a FasterRCNN in terms of training?

I have 2k images with roughly 10 classes each. I currently have a model that took me two day to train and performs rather well roughly 10000 iters batch size 4 using Colab Pro. Any tips to better tune some parameters? Thanks you

lessw2020 commented 4 years ago

1 - I made a new PR to hopefully cleanly and robustly handle supporting args.num_classes with full backwards compat: https://github.com/facebookresearch/detr/pull/89

2 - @lgvaz - thanks for the code snippet! I initially had similar (a None check for args.num_classes and default None), but that will throw an exception if num_classes not present at all in args.
I have several modified main.py and others likely as well where args.num_classes would not be present at all, so I wrappered the check in a try/except block and also defaulted it to 20 in both cases to be back compat with the previous code). I also went with a simple if/then blocks to check for coco and coco_panoptic respectively to keep it uber-readable.

3 - @lgvaz or others - do you happen to have a lightweight wrapper class for supporting datasets in coco format ala handling class_id mapping I could use for the colab? I have my own coco class I made, but it needs reworking imo though it handles the class id mapping issue which blew up my initial detr training. For the colab I'm trying to keep it very light and minimal any new/external requirements, and integrated with detr as cleanly as possible... so I don't want to integrate a larger project like mantisshrimp for it. But from a quick look tonight at the mantisshrimp project I definitely like some of the abstraction work you've done there with the parser and datasets (reminds me of fastai), so if it could be split off as just a custom class wrapper that would be ideal.

lessw2020 commented 4 years ago

@AlexAndrei98 - I was able to train my custom dataset via fine-tuning in half a day on a v100.

For a 2K dataset though I would try fine-tuning first and I don't envision you would need 2 days though obviously what gpu is going to impact that.
I had good results with the default params supplied (bs = 2, etc) so I would also start with those, but you could do a shorter cycle up front and review before committing to the default 300 epochs. As an initial test, I trained for 60 epochs, and did the lr drop at 50 as a first test. That was plenty to review the model with test data and get an idea, so you could try that and check to determine how much total training time you might need.

m-klasen commented 4 years ago

@lessw2020 Hi, did you have any success fine tuning on a different backbone in your experiments? Afaik the resnet used is quite standard except for a short stem. However, despite training quite extensively I never came close to the AP achieved with the provided resnet. Any thoughts on how to address this issue?

tazu786 commented 4 years ago

Hi, while formatting my custom dataset into coco format, I came across this (datasets/transforms line 253 on):

if "boxes" in target: boxes = target["boxes"] boxes = box_xyxy_to_cxcywh(boxes) boxes = boxes / torch.tensor([w, h, w, h], dtype=torch.float32) target["boxes"] = boxes Why should I apply xyxy_to_cxcywh in the normalization of the bbox target if the coco format for the bbox is already xywh (with xy top left corner)?

fmassa commented 4 years ago

@tazu786 in our implementation of COCO datasets, we convert the boxes to x1y1x2y2 format, see https://github.com/facebookresearch/detr/blob/1fcfc65f5cfef0836d349d618aa4afe30aeb838e/datasets/coco.py#L67