facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.22k stars 5.45k forks source link

Training Mask RCNN on a new dataset #77

Open fatemeh-slh opened 6 years ago

fatemeh-slh commented 6 years ago

Hi,

I have a new dataset and I prepared coco-like annotations for instance level semantic segmentation task. When I want to train Mask RCNN on my own data, I get following error. I am wondering what is TRAIN.PROPOSAL_FILES in the config file and how should I fill it for a new dataset.

Traceback (most recent call last):
  File "tools/train_net.py", line 358, in <module>
    main()
  File "tools/train_net.py", line 196, in main
    checkpoints = train_model()
  File "tools/train_net.py", line 210, in train_model
    setup_model_for_training(model, output_dir)
  File "tools/train_net.py", line 308, in setup_model_for_training
    add_model_training_inputs(model)
  File "tools/train_net.py", line 332, in add_model_training_inputs
    cfg.TRAIN.DATASETS, cfg.TRAIN.PROPOSAL_FILES
  File "/detectron/lib/datasets/roidb.py", line 60, in combined_roidb_for_training
    assert len(dataset_names) == len(proposal_files)
AssertionError

I also tried to train on COCO dataset and again I got the same error. When I comment the PROPOSAL_FILES from config file, it gives me this output:

CRITICAL train_net.py: 236: Loss is NaN, exiting...
INFO loader.py: 126: Stopping enqueue thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread

What is your suggestion? Thanks.

nonstop1962 commented 6 years ago

Not the exact same case with you, but in my case 'Loss is NaN' problem is solved by lowering the learning rate (e.g. 0.001 --> 0.0001).

Devincool commented 6 years ago

Could your teach me how to make custom coco-like dataset? Thanks!

fatemeh-slh commented 6 years ago

I finally figured out that if I use the end to end models, I don't need proposal files anymore. In order to make coco-like format, I tried to convert binary masks to polygon and then I used the format which has been discussed in http://cocodataset.org/#download. I don't have any problem with detection tasks on my own data but I still get this error when I want to train Mask RCNN:


0201 18:37:39.300101 56408 context_gpu.cu:325] Total: 2638 MB
I0201 18:37:39.324462 56409 context_gpu.cu:321] GPU 0: 2767 MB
I0201 18:37:39.324494 56409 context_gpu.cu:325] Total: 2767 MB
I0201 18:37:39.354914 56410 context_gpu.cu:321] GPU 0: 2911 MB
I0201 18:37:39.354959 56410 context_gpu.cu:325] Total: 2911 MB
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at blob.h:94] IsType<T>(). wrong type for the Blob instance. Blob contains nullptr (uninitialized) while caller expects caffe2::Tensor<caffe2::CUDAContext> .
Offending Blob name: gpu_0/_[mask]_fcn1_w.
Error from operator: 
input: "gpu_0/_[mask]_roi_feat" input: "gpu_0/_[mask]_fcn1_w" input: "gpu_0/_[mask]_fcn1_b" output: "gpu_0/_[mask]_fcn1" name: "" type: "Conv" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
*** Aborted at 1517470659 (unix time) try "date -d @1517470659" if you are using GNU date ***
PC: @     0x7fc605f0c428 gsignal
*** SIGABRT (@0x8e0700000dc12) received by PID 56338 (TID 0x7fc4d3fff700) from PID 56338; stack trace: ***
    @     0x7fc6062b2390 (unknown)
    @     0x7fc605f0c428 gsignal
    @     0x7fc605f0e02a abort
    @     0x7fc60378684d __gnu_cxx::__verbose_terminate_handler()
    @     0x7fc6037846b6 (unknown)
    @     0x7fc603784701 std::terminate()
    @     0x7fc6037afd38 (unknown)
    @     0x7fc6062a86ba start_thread
    @     0x7fc605fde41d clone
    @                0x0 (unknown)
Aborted

I used cv2.findContours function to obtain polygons from binary masks. Do you have any idea about this error? Is it something related to the JSON format?

Thanks.

xuhuaren commented 6 years ago

Hi, you problem has been solved? I wanna ask how to make a coco-like json file.

topcomma commented 6 years ago

Hi,

I have encountered the same issue when train a extended coco-like json data. @faticom ,@xuhuaren Do you fix it?

Thanks.

ycAlex11 commented 6 years ago

hi I have done to create my own coco-like josn file, to create your own one, you need make sure your data structure in coco-like.json is same as coco'file and you need write your own code to create it. In original coco's ison fils, it is a big dict which have a few keys: info (which you don't really need this), license( I guess this is not necessary), images, annotations and categories.

So in your coco-like.json file, you need make sure your data structure is dict comes with at least 3 keys which are images, annotations and categories Each key's value is a list, the length of the list depends on how many pictures you have. and each element in the list is another dict.

So make sure your josn file has same structure, I think you should be able to train the model with your own data @xuhuaren @topcomma

wangzhangup commented 6 years ago

@xuhuaren @Devincool Here is my code to create coco-style dataset.