DINO evaluation on custom dataset crushes

alrightkami commented 1 year ago

When fine-tuning with following config:

import datetime
from detrex.config import get_config
from .models.dino_swin_large_384 import model

# number of classes
model.num_classes = 1

# get default config
dataloader = get_config("common/data/coco_detr_graffiti.py").dataloader
optimizer = get_config("common/optim.py").AdamW
lr_multiplier = get_config("common/coco_schedule.py").lr_multiplier_12ep
train = get_config("common/train.py").train

# modify training config
train.init_checkpoint = "./projects/dino/dino_swin_large_4scale_12ep.pth"
train.output_dir = "./output/graffiti_dino/satellit" + datetime.datetime.now().strftime("%d%m_%H%M")

# max training iterations
train.max_iter = 50000

# run evaluation every n iters
train.eval_period = 100000  # FYI evaluatilon crashes! TODO fix

# log training infomation every 20 iters
train.log_period = 20

# save checkpoint every n iters
train.checkpointer.period = 9999

# gradient clipping for training
train.clip_grad.enabled = True
train.clip_grad.params.max_norm = 0.1
train.clip_grad.params.norm_type = 2

# set training devices
train.device = "cuda" # or "cuda:1" or "cpu"
model.device = train.device

# modify optimizer config
optimizer.lr = 1e-4
optimizer.betas = (0.9, 0.999)
optimizer.weight_decay = 1e-4
optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" in module_name else 1

# modify dataloader config
dataloader.train.num_workers = 2
dataloader.train.dataset.filter_empty = False 

# please notice that this is total batch size
# surpose you're using 4 gpus for training and the batch size for
# each gpu is 16/4 = 4
dataloader.train.total_batch_size = 2

# data
dataloader.train.dataset.names = 'graffiti_train'
dataloader.test.dataset.names = 'graffiti_test'
dataloader.evaluator.dataset_name = 'graffiti_val'

# dump the testing results into output_dir for visualization
dataloader.evaluator.output_dir = train.output_dir

I'm getting an exception:

[32m[12/11 10:28:13 d2.utils.events]: [0m eta: 20:01:48  iter: 79  total_loss: 11.8  loss_class: 0.3539  loss_bbox: 0.1169  loss_giou: 0.3934  loss_class_0: 0.5056  loss_bbox_0: 0.1242  loss_giou_0: 0.4234  loss_class_1: 0.5262  loss_bbox_1: 0.09443  loss_giou_1: 0.3764  loss_class_2: 0.409  loss_bbox_2: 0.111  loss_giou_2: 0.3898  loss_class_3: 0.3466  loss_bbox_3: 0.1062  loss_giou_3: 0.3604  loss_class_4: 0.3582  loss_bbox_4: 0.1168  loss_giou_4: 0.3903  loss_class_enc: 0.6693  loss_bbox_enc: 0.147  loss_giou_enc: 0.5353  loss_class_dn: 0.01112  loss_bbox_dn: 0.1165  loss_giou_dn: 0.4171  loss_class_dn_0: 0.1569  loss_bbox_dn_0: 0.219  loss_giou_dn_0: 0.6382  loss_class_dn_1: 0.03851  loss_bbox_dn_1: 0.1345  loss_giou_dn_1: 0.4488  loss_class_dn_2: 0.02079  loss_bbox_dn_2: 0.1128  loss_giou_dn_2: 0.407  loss_class_dn_3: 0.01118  loss_bbox_dn_3: 0.1179  loss_giou_dn_3: 0.4205  loss_class_dn_4: 0.008087  loss_bbox_dn_4: 0.1166  loss_giou_dn_4: 0.4164  time: 1.3978  data_time: 0.0050  lr: 0.0001  max_mem: 35711M
[32m[12/11 10:28:40 d2.data.datasets.coco]: [0mLoaded 37 images in COCO format from /home/jovyan/data/kamila/data/satellit/splitted/test/annotations/instances_default.json
[32m[12/11 10:28:40 d2.data.build]: [0mDistribution of instances among all 1 categories:
[36m|  category  | #instances   |
|:----------:|:-------------|
|  graffiti  | 91           |
|            |              |[0m
[32m[12/11 10:28:40 d2.data.common]: [0mSerializing 37 elements to byte tensors and concatenating them all ...
[32m[12/11 10:28:40 d2.data.common]: [0mSerialized dataset takes 0.09 MiB
[32m[12/11 10:28:40 d2.evaluation.evaluator]: [0mStart inference on 37 batches
[32m[12/11 10:28:43 d2.evaluation.evaluator]: [0mInference done 11/37. Dataloading: 0.0010 s/iter. Inference: 0.2747 s/iter. Eval: 0.0006 s/iter. Total: 0.2763 s/iter. ETA=0:00:07
[32m[12/11 10:28:49 d2.evaluation.evaluator]: [0mInference done 30/37. Dataloading: 0.0013 s/iter. Inference: 0.2739 s/iter. Eval: 0.0005 s/iter. Total: 0.2757 s/iter. ETA=0:00:01
[32m[12/11 10:28:51 d2.evaluation.evaluator]: [0mTotal inference time: 0:00:08.947697 (0.279616 s / iter per device, on 1 devices)
[32m[12/11 10:28:51 d2.evaluation.evaluator]: [0mTotal inference pure compute time: 0:00:08 (0.273675 s / iter per device, on 1 devices)
[32m[12/11 10:28:51 d2.evaluation.coco_evaluation]: [0mPreparing results for COCO format ...
[32m[12/11 10:28:51 d2.evaluation.coco_evaluation]: [0mSaving results to ./output/graffiti_dino/satellit1112_1026/coco_instances_results.json
[32m[12/11 10:28:51 d2.evaluation.coco_evaluation]: [0mEvaluating predictions with unofficial COCO API...
Loading and preparing results...
[4m[5m[31mERROR[0m [32m[12/11 10:28:51 d2.engine.train_loop]: [0mException during training:
Traceback (most recent call last):
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/train_loop.py", line 150, in train
    self.after_step()
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/train_loop.py", line 180, in after_step
    h.after_step()
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/hooks.py", line 555, in after_step
    self._do_eval()
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/hooks.py", line 528, in _do_eval
    results = self._func()
  File "tools/train_net_satellit_graffiti.py", line 194, in <lambda>
    hooks.EvalHook(cfg.train.eval_period, lambda: do_test(cfg, model)),
  File "tools/train_net_satellit_graffiti.py", line 135, in do_test
    ret = inference_on_dataset(
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/evaluation/evaluator.py", line 204, in inference_on_dataset
    results = evaluator.evaluate()
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/evaluation/coco_evaluation.py", line 206, in evaluate
    self._eval_predictions(predictions, img_ids=img_ids)
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/evaluation/coco_evaluation.py", line 266, in _eval_predictions
    _evaluate_predictions_on_coco(
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/evaluation/coco_evaluation.py", line 590, in _evaluate_predictions_on_coco
    coco_dt = coco_gt.loadRes(coco_results)
  File "/opt/conda/lib/python3.8/site-packages/pycocotools/coco.py", line 327, in loadRes
    assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), \
AssertionError: Results do not correspond to current coco set
[32m[12/11 10:28:51 d2.engine.hooks]: [0mOverall training speed: 97 iterations in 0:02:16 (1.4051 s / it)
[32m[12/11 10:28:51 d2.engine.hooks]: [0mTotal training time: 0:02:27 (0:00:11 on hooks)
[32m[12/11 10:28:51 d2.utils.events]: [0m eta: 20:00:31  iter: 99  total_loss: 8.968  loss_class: 0.3145  loss_bbox: 0.07012  loss_giou: 0.3334  loss_class_0: 0.4652  loss_bbox_0: 0.08328  loss_giou_0: 0.3327  loss_class_1: 0.5017  loss_bbox_1: 0.06474  loss_giou_1: 0.2563  loss_class_2: 0.3895  loss_bbox_2: 0.07384  loss_giou_2: 0.3273  loss_class_3: 0.3095  loss_bbox_3: 0.06951  loss_giou_3: 0.329  loss_class_4: 0.3126  loss_bbox_4: 0.07025  loss_giou_4: 0.3316  loss_class_enc: 0.5665  loss_bbox_enc: 0.1357  loss_giou_enc: 0.4332  loss_class_dn: 0.002992  loss_bbox_dn: 0.07017  loss_giou_dn: 0.3381  loss_class_dn_0: 0.1288  loss_bbox_dn_0: 0.1546  loss_giou_dn_0: 0.5158  loss_class_dn_1: 0.02353  loss_bbox_dn_1: 0.08345  loss_giou_dn_1: 0.3674  loss_class_dn_2: 0.01641  loss_bbox_dn_2: 0.07573  loss_giou_dn_2: 0.3332  loss_class_dn_3: 0.009212  loss_bbox_dn_3: 0.07121  loss_giou_dn_3: 0.3396  loss_class_dn_4: 0.00324  loss_bbox_dn_4: 0.07031  loss_giou_dn_4: 0.3378  time: 1.3907  data_time: 0.0049  lr: 0.0001  max_mem: 35711M
Traceback (most recent call last):
  File "tools/train_net_satellit_graffiti.py", line 232, in <module>
    launch(
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/launch.py", line 82, in launch
    main_func(*args)
  File "tools/train_net_satellit_graffiti.py", line 227, in main
    do_train(args, cfg)
  File "tools/train_net_satellit_graffiti.py", line 211, in do_train
    trainer.train(start_iter, cfg.train.max_iter)
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/train_loop.py", line 150, in train
    self.after_step()
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/train_loop.py", line 180, in after_step
    h.after_step()
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/hooks.py", line 555, in after_step
    self._do_eval()
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/engine/hooks.py", line 528, in _do_eval
    results = self._func()
  File "tools/train_net_satellit_graffiti.py", line 194, in <lambda>
    hooks.EvalHook(cfg.train.eval_period, lambda: do_test(cfg, model)),
  File "tools/train_net_satellit_graffiti.py", line 135, in do_test
    ret = inference_on_dataset(
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/evaluation/evaluator.py", line 204, in inference_on_dataset
    results = evaluator.evaluate()
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/evaluation/coco_evaluation.py", line 206, in evaluate
    self._eval_predictions(predictions, img_ids=img_ids)
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/evaluation/coco_evaluation.py", line 266, in _eval_predictions
    _evaluate_predictions_on_coco(
  File "/home/jovyan/data/kamila/detrex/detectron2/detectron2/evaluation/coco_evaluation.py", line 590, in _evaluate_predictions_on_coco
    coco_dt = coco_gt.loadRes(coco_results)
  File "/opt/conda/lib/python3.8/site-packages/pycocotools/coco.py", line 327, in loadRes
    assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), \
AssertionError: Results do not correspond to current coco set

SlongLiu commented 1 year ago

Do you evaluate the COCO pre-trained models on your custom dataset? You need to train a new model for your dataset.

alrightkami commented 1 year ago

@SlongLiu I'm not evaluating, I'm training your pre-trained weights on a custom class with train_net.py

alrightkami commented 1 year ago

@SlongLiu any updates? were you able to reproduce the issue?

SlongLiu commented 1 year ago

As I do not know the custom dataset you used. I cannot reproduce the error.

However, the error is raised by cocoapi. There are some similar problems in other projects, which may be helpful. https://github.com/facebookresearch/detectron2/issues/1570 https://github.com/cocodataset/cocoapi/issues/497

One possible reason: https://github.com/facebookresearch/detectron2/issues/1570#issuecomment-641002199

If your data loading function get_kaist_dicts does not return the same data each time (seems to be the case in your code), the predicted image ids will not match the original image ids.

alrightkami commented 1 year ago

@SlongLiu it does not matter what custom dataset you use as far as it stays custom and not COCO in order to reproduce the error. Also, I do not understand why are you hinting at somebody else's code, I do not have a function get_kaist_dicts

SlongLiu commented 1 year ago

We did not meet the problem in our other custom datasets.

You seem to have other costume codes, such as the coco_detr_graffiti.py file, which blocks us from reproducing it.

I do not think the problem is caused by our project, as our custom dataset design inherits from detectron2. Hence, I suggest you refer to other similar issues in the project, which may be helpful.

alrightkami commented 1 year ago

@SlongLiu the custom code in coco_detr_graffiti.py is not altering anything, the issue is reproducible with coco_detr.py as well. I just tried it and I'm getting the exact same error

SlongLiu commented 1 year ago

Thanks for the question, I will check it out soon.

Could you provide more details about the commands you used? As well as other modified codes if there are.

alrightkami commented 1 year ago

@SlongLiu sure, the command I used is !cd /home/jovyan/data/kamila/detrex && python tools/train_net_satellit_graffiti.py --config-file projects/dino/configs/dino_swin_large_384_4scale_12ep_graffiti_satellit.py

The config is the same as above, except for the dataloader: dataloader = get_config("common/data/coco_detr").dataloader However, as I mentioned before, both custom and standard dataloader lead to the same issue.

And the train_net_satellit_graffiti.py is a copy of train_net.py with a single change: I'm registering my datasets within detectron2 in the main method:

def register_data()
    """
    Registers all of the graffiti data in Detectron's DatasetCatalog.
    """
    dirs = ['train', 'val', 'test']
    dataset_size = []

    for i in range(len(dirs)):
        folder = '/home/jovyan/data/kamila/data/{}/'.format(dirs[i])
        annotations = folder + 'annotations/instances_default.json'
        images = folder + 'images/'
        dataset_name = 'graffiti_{}'.format(dirs[i])

        register_coco_instances(
            dataset_name,
            {},
            annotations,
            images)

        dataset_size.append(len(glob.glob(images + '*.jpg')))

        print('[{}/{}] Following dataset was sucesfullly registred: {} \npath: {} \ntotal_images: {}'.format(i, len(dirs), dataset_name, folder, sum(dataset_size)))

def main(args):
    register_data()
    cfg = LazyConfig.load(args.config_file)
    cfg = LazyConfig.apply_overrides(cfg, args.opts)
    default_setup(cfg, args)
    ...

SlongLiu commented 1 year ago

Thanks for the details.

I tested the model with the modification you provided. I find you still infer models on your custom dataset in your latest experiments. Although changing the line dataloader = get_config("common/data/coco_detr.py").dataloader, lines 57-59 in the config file specify the used dataset to your custom dataset.

The lines 57-59:

dataloader.train.dataset.names = 'graffiti_train'
dataloader.test.dataset.names = 'graffiti_test'
dataloader.evaluator.dataset_name = 'graffiti_val'

When I direct ran the shell you provided, it raised an error:

Traceback (most recent call last):
  File "tools/train_net_satellit_graffiti.py", line 262, in <module>
    args=(args,),
  File "/home/liushilong/code/maskdino_raw/detectron2/detectron2/engine/launch.py", line 82, in launch
    main_func(*args)
  File "tools/train_net_satellit_graffiti.py", line 249, in main
    print(do_test(cfg, model))
  File "tools/train_net_satellit_graffiti.py", line 136, in do_test
    model, instantiate(cfg.dataloader.test), instantiate(cfg.dataloader.evaluator)
  File "/home/liushilong/code/maskdino_raw/detectron2/detectron2/config/instantiate.py", line 67, in instantiate
    cfg = {k: instantiate(v) for k, v in cfg.items()}
  File "/home/liushilong/code/maskdino_raw/detectron2/detectron2/config/instantiate.py", line 67, in <dictcomp>
    cfg = {k: instantiate(v) for k, v in cfg.items()}
  File "/home/liushilong/code/maskdino_raw/detectron2/detectron2/config/instantiate.py", line 83, in instantiate
    return cls(**cfg)
  File "/home/liushilong/code/maskdino_raw/detectron2/detectron2/data/build.py", line 241, in get_detection_dataset_dicts
    dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in names]
  File "/home/liushilong/code/maskdino_raw/detectron2/detectron2/data/build.py", line 241, in <listcomp>
    dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in names]
  File "/home/liushilong/code/maskdino_raw/detectron2/detectron2/data/catalog.py", line 58, in get
    return f()
  File "/home/liushilong/code/maskdino_raw/detectron2/detectron2/data/datasets/coco.py", line 500, in <lambda>
    DatasetCatalog.register(name, lambda: load_coco_json(json_file, image_root, name))
  File "/home/liushilong/code/maskdino_raw/detectron2/detectron2/data/datasets/coco.py", line 69, in load_coco_json
    coco_api = COCO(json_file)
  File "/home/liushilong/anaconda3/envs/q2x/lib/python3.7/site-packages/pycocotools/coco.py", line 84, in __init__
    with open(annotation_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/data/kamila/data/test/annotations/instances_default.json'

It means the code still runs on your custom dataset. (The dataset registration process ran successfully.)

When annotating the lines 57-59, the inference runs successfully on COCO without any error.

I suggest stepping into the error lines, i.e.,

  File "/opt/conda/lib/python3.8/site-packages/pycocotools/coco.py", line 327, in loadRes
    assert set(annsImgIds) == (set(annsImgIds) & set(self.getImgIds())), \

to compare the difference between the ground truth ids and model-predicted ids.

alrightkami commented 1 year ago

@SlongLiu It is pretty obvious my code crushes if you run it, because you don't have the data I have. This is what error FileNotFoundError: [Errno 2] No such file or directory: '/home/jovyan/data/kamila/data/test/annotations/instances_default.json' means, it is because this file is located on my computer and not yours.

Let me now repeat myself.

What I want to do: I want to use your pre-trained weights to fine-tune them on my custom data with one class 'graffiti'. I do not want to evaluate a trained model, and I do not want to run inference on COCO. I want to see the evaluation of the model during the training every n iterations as in train.eval_period in config file i.e. while the model is training I want to see every n iterations how/if is it improving.

What I did: As stated in the tutorial, I added init_checkpoint of the pre-trained model, custom data and n_classes = 1 to the config file and I run train_net.py.

What happened: Apparently, instead of evaluating the performance of the model on my custom class I'm currently training the model on, it is being evaluated on COCO classes, which I do not need because when I'm fine-tuning on my custom class and custom data, I need to see performance of my custom class.

Can you tell me now what am I doing wrong? Is it not the way to achieve what I want?

@rentainhe it would be really nice if you could help out here because I opened the issue a month ago and there's still no improvement whatsoever.

SlongLiu commented 1 year ago

@alrightkami

Hello, I want to explain what my experiments show:

@SlongLiu the custom code in coco_detr_graffiti.py is not altering anything, the issue is reproducible with coco_detr.py as well. I just tried it and I'm getting the exact same error

You said the model was crushed on COCO as well in your experiments. The reason is that you still inferred the model on your custom dataset. The model can work well on COCO.

Hence, the reason for your error is your custom dataset.

alrightkami commented 1 year ago

@SlongLiu Of course, the model can work well on COCO, but this is not what I need. While training on a custom class 'graffiti', I want the model's performance on this custom class 'graffiti' and not on people, cars, or birds. So here I am, asking what am I doing wrong and how can I achieve that. But all you tell me is that your model works fine on COCO.

alrightkami commented 1 year ago

are there any updates? @rentainhe

IDEA-Research / detrex

DINO evaluation on custom dataset crushes #171