Partially labelled dataset

Fabio-Arup-Panella commented 3 years ago

Hi Yen-Cheng, I am working on a project where, because of some issues, we were able to label only a proportion of the dataset. Let's say, out of 500 images only 120 were labelled. Is it possible to use all the 120 as training labelled data and the rest as training unlabelled data? If so, how do you recommend addressing this? Below is an example of annotations (of course I can modify it)

{"source-ref":"s3://bucketName/imgName1.png","Dataset_BB":{"annotations":[{"left":2726,"top":675,"width":92,"height":324,"class_id":2},{"left":2352,"top":799,"width":54,"height":193,"class_id":2},{"left":3473,"top":731,"width":68,"height":303,"class_id":2},{"left":3784,"top":869,"width":51,"height":178,"class_id":2},{"left":3900,"top":929,"width":33,"height":121,"class_id":2},{"left":2237,"top":868,"width":35,"height":125,"class_id":2},{"left":2184,"top":902,"width":27,"height":94,"class_id":2},{"left":1965,"top":898,"width":52,"height":12,"class_id":0},{"left":1939,"top":869,"width":66,"height":18,"class_id":0},{"left":1893,"top":823,"width":93,"height":21,"class_id":0},{"left":1790,"top":718,"width":153,"height":35,"class_id":0},{"left":1416,"top":411,"width":304,"height":145,"class_id":0},{"left":268,"top":510,"width":272,"height":112,"class_id":0},{"left":112,"top":798,"width":138,"height":32,"class_id":0},{"left":3637,"top":667,"width":33,"height":36,"class_id":4},{"left":2381,"top":756,"width":15,"height":15,"class_id":4}],"image_size":[{"width":4096,"height":2048,"depth":3}]},"Dataset_BB-metadata":{"job-name":"labeling-job/Dataset_BB","class-map":{"0":"Idler","2":"Pipe_Bracket","4":"Ring_Number"},"human-annotated":"yes","objects":[{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1}],"creation-date":"2020-10-05T20:42:53.052Z","type":"groundtruth/object-detection"}}
{"source-ref":"s3://bucketName/imgName2.png","Dataset_BB":{"annotations":[{"left":1353,"top":366,"width":306,"height":172,"class_id":0},{"left":235,"top":549,"width":263,"height":96,"class_id":0},{"left":103,"top":807,"width":133,"height":32,"class_id":0},{"left":1772,"top":710,"width":166,"height":32,"class_id":0},{"left":1884,"top":817,"width":102,"height":22,"class_id":0},{"left":1963,"top":899,"width":55,"height":10,"class_id":0},{"left":1934,"top":869,"width":71,"height":15,"class_id":0},{"left":3520,"top":745,"width":68,"height":295,"class_id":2},{"left":2783,"top":661,"width":95,"height":348,"class_id":2},{"left":3800,"top":874,"width":49,"height":173,"class_id":2},{"left":3903,"top":930,"width":34,"height":119,"class_id":2},{"left":2370,"top":788,"width":59,"height":205,"class_id":2},{"left":2243,"top":867,"width":36,"height":126,"class_id":2},{"left":2188,"top":900,"width":26,"height":94,"class_id":2},{"left":3666,"top":687,"width":30,"height":31,"class_id":4},{"left":2400,"top":746,"width":15,"height":14,"class_id":4}],"image_size":[{"width":4096,"height":2048,"depth":3}]},"Dataset_BB-metadata":{"job-name":"labeling-job/Dataset_BB","class-map":{"0":"Idler","2":"Pipe_Bracket","4":"Ring_Number"},"human-annotated":"yes","objects":[{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1},{"confidence":1}],"creation-date":"2020-10-05T20:39:16.127Z","type":"groundtruth/object-detection"}}
{"source-ref":"s3://bucketName/imgName3.png"}
{"source-ref":"s3://bucketName/imgName4.png"}

vlfom commented 3 years ago

If this helps, I can suggest you to:

check the official Detectron2's tutorial on custom datasets (and follow the Colab notebook) so you know how you can add & use your custom dataset
set up your dataset for supervised detection first and make sure the training works OK
finally, add the unsupervised training component

Regarding the last part, to pick the images for the (un)supervised learning, the authors just randomly split images at the beginning by sampling a list of indices (see divide_label_unlabel here). However, in the current implementation, they actually read pre-generated indices to make results reproducible. You may use exactly the same trick to distinguish between (un)labeled images. E.g. you can order the images in your dataset such that the first 120 are labeled, and the rest 380 are not and reflect it in the seed (or just hardcode it).

For the images that were picked to be used for the "unsupervised part", the authors just delete the labels inside the training loop (see run_step_full_semisup here ).

At this point, I am not sure if you can supply Detectron2 with your 380 images without labels (it may skip them), - if yes, you can just put your images in a format similar to what you mentioned, but if at least 1bbox per image is required, one idea could be to add some random annotations for them, as, anyway, those would be removed inside the training loop.

sarmientoj24 commented 2 years ago

@vlfom

if yes, you can just put your images in a format similar to what you mentioned, but if at least 1bbox per image is required, one idea could be to add some random annotations for them, as, anyway, those would be removed inside the training loop.

Does this mean all images (both labeled and unlabeled) should have annotations with them?

icrto commented 2 years ago

@sarmientoj24 I think unlabeled images do not need to have annotations with them. You just need to make sure that the filter_empty field is set to False, as is done here.

facebookresearch / unbiased-teacher

Partially labelled dataset #10