Error while using RLE for Instance Segmentation

StudioTV commented 2 years ago

Hello guys !

First of all, thank you to the team of Facebook for this amazing tool. I am using detectron2 for a lot of projects now and I have never faced a problem like this before. I have a CocoDataset using RLE for segmentation, you can see an example here :

"annotations": [{
"id": 0, 
"bbox": [140, 167, 182, 206], 
"segmentation": {
          "size": [266, 266], 
          "counts": [37430, 6, 37694, 10, 37954, 17, 38218, 21, 38482, 23, 38747, 25, 39012, 27, 39277, 28, 39542, 29, 39807, 30, 40072, 32, 40338, 32, 40603, 33, 40868, 35, 41134, 35, 41399, 37, 41665, 37, 41931, 37, 42197, 38, 42462, 39, 42728, 39, 42994, 39, 43260, 39, 43526, 39, 43792, 39, 44058, 39, 44324, 39, 44591, 38, 44857, 38, 45123, 38, 45390, 37, 45657, 36, 45923, 36, 46190, 34, 46456, 34, 46723, 32, 46989, 32, 47256, 30, 47523, 28, 47790, 26, 48058, 22, 48327, 16, 48597, 10]}, 
"image_id": 42, 
"ignore": 0, 
"category_id": 3, 
"iscrowd": 0, 
"area": 1638.0}

Since the end of 2019, detectron2 has been natively able to read RLE, as seen in the official documentation.

But It actually doesn't work, I tried a lot of parameters, but I still have a really weird error message that I can't debug, that's why I am asking you guys. I tried with and without the cfg.INPUT.MASK_FORMAT='bitmask' but it doesn't work.

I tried the simple COCO detection, and the model is working fine, I also tried to put random polygons and not RLE and it also works great, so it seems like there is something wrong with RLE.

Instructions To Reproduce the 🐛 Bug:

Full runnable code or full changes you made:



setup_logger()
register_coco_instances("medecine_train", {}, "XXXXXXXXXX/dataset/train.json", "XXXXXXXXXX/dataset/DATA_train/images")
register_coco_instances("medecine_val", {}, "XXXXXXXXXX/dataset/val.json", "XXXXXXXXXX/dataset/DATA_val/images")

cfg = get_cfg()

cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
cfg.MODEL.RETINANET.NUM_CLASSES = 3
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml") 

cfg.DATASETS.TRAIN = ("medecine_train",)
cfg.DATASETS.TEST = ("medecine_val",) 

cfg.DATALOADER.NUM_WORKERS = 4
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.0001

cfg.SOLVER.WARMUP_ITERS = 50
cfg.SOLVER.MAX_ITER = 1000
cfg.SOLVER.STEPS = []

cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 16
cfg.TEST.EVAL_PERIOD = 100
#cfg.INPUT.MASK_FORMAT='bitmask'

2. __Full logs__ or other relevant observations:

[05/27 15:11:30 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [05/27 15:11:30 d2.data.build]: Using training sampler TrainingSampler [05/27 15:11:30 d2.data.common]: Serializing 13297 elements to byte tensors and concatenating them all ... [05/27 15:11:30 d2.data.common]: Serialized dataset takes 10.90 MiB [05/27 15:11:35 d2.engine.train_loop]: Starting training from iteration 0 ERROR [05/27 15:12:10 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "E:\anaconda3\envs\detectron2\lib\site-packages\torch\utils\data\dataloader.py", line 986, in _try_get_data data = self._data_queue.get(timeout=timeout) File "E:\anaconda3\envs\detectron2\lib\multiprocessing\queues.py", line 108, in get raise Empty _queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "e:\XXXXXXX\model\detectron2\detectron2-main\detectron2\engine\train_loop.py", line 149, in train self.run_step() File "e:\XXXXXXX\model\detectron2\detectron2-main\detectron2\engine\defaults.py", line 494, in run_step self._trainer.run_step() File "e:\XXXXXXX\model\detectron2\detectron2-main\detectron2\engine\train_loop.py", line 267, in run_step data = next(self._data_loader_iter) File "e:\XXXXXXX\model\detectron2\detectron2-main\detectron2\data\common.py", line 234, in iter for d in self.dataset: File "E:\anaconda3\envs\detectron2\lib\site-packages\torch\utils\data\dataloader.py", line 517, in next data = self._next_data() File "E:\anaconda3\envs\detectron2\lib\site-packages\torch\utils\data\dataloader.py", line 1182, in _next_data idx, data = self._get_data() File "E:\anaconda3\envs\detectron2\lib\site-packages\torch\utils\data\dataloader.py", line 1148, in _get_data success, data = self._try_get_data() File "E:\anaconda3\envs\detectron2\lib\site-packages\torch\utils\data\dataloader.py", line 999, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 14736) exited unexpectedly [05/27 15:12:10 d2.engine.hooks]: Total training time: 0:00:35 (0:00:00 on hooks) [05/27 15:12:10 d2.utils.events]: iter: 0 lr: N/A max_mem: 173M Traceback (most recent call last): File "E:\anaconda3\envs\detectron2\lib\site-packages\torch\utils\data\dataloader.py", line 986, in _try_get_data data = self._data_queue.get(timeout=timeout) File "E:\anaconda3\envs\detectron2\lib\multiprocessing\queues.py", line 108, in get raise Empty _queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train.py", line 115, in train() File "train.py", line 112, in train trainer.train() File "e:\XXXXXXX\model\detectron2\detectron2-main\detectron2\engine\defaults.py", line 484, in train super().train(self.start_iter, self.max_iter) File "e:\XXXXXXX\model\detectron2\detectron2-main\detectron2\engine\train_loop.py", line 149, in train self.run_step() File "e:\XXXXXXX\model\detectron2\detectron2-main\detectron2\engine\defaults.py", line 494, in run_step self._trainer.run_step() File "e:\XXXXXXX\model\detectron2\detectron2-main\detectron2\engine\train_loop.py", line 267, in run_step data = next(self._data_loader_iter) File "e:\XXXXXXX\model\detectron2\detectron2-main\detectron2\data\common.py", line 234, in iter for d in self.dataset: File "E:\anaconda3\envs\detectron2\lib\site-packages\torch\utils\data\dataloader.py", line 517, in next data = self._next_data() File "E:\anaconda3\envs\detectron2\lib\site-packages\torch\utils\data\dataloader.py", line 1182, in _next_data idx, data = self._get_data() File "E:\anaconda3\envs\detectron2\lib\site-packages\torch\utils\data\dataloader.py", line 1148, in _get_data success, data = self._try_get_data() File "E:\anaconda3\envs\detectron2\lib\site-packages\torch\utils\data\dataloader.py", line 999, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 14736) exited unexpectedly


## Expected behavior:

The training should work whithout any error.

## Environment:

sys.platform win32 Python 3.8.13 (default, Mar 28 2022, 06:59:08) [MSC v.1916 64 bit (AMD64)] numpy 1.19.5 detectron2 0.6 @e:\XXXXXX\model\detectron2\detectron2-main\detectron2 detectron2._C not built correctly: DLL load failed while importing _C: La procédure spécifiée est introuvable. DETECTRON2_ENV_MODULE PyTorch 1.8.2+cu102 @E:\anaconda3\envs\detectron2\lib\site-packages\torch PyTorch debug build False GPU available Yes GPU 0 NVIDIA GeForce GTX 1080 Ti (arch=6.1) Driver version 512.95 CUDA_HOME C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2 Pillow 8.1.2 torchvision 0.9.2+cu102 @E:\anaconda3\envs\detectron2\lib\site-packages\torchvision torchvision arch flags E:\anaconda3\envs\detectron2\lib\site-packages\torchvision_C.pyd; cannot find cuobjdump fvcore 0.1.5.post20220512 iopath 0.1.9 cv2 4.5.1

PyTorch built with:

C++ Version: 199711
MSVC 192930040
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
OpenMP 2019
CPU capability usage: NO AVX
CUDA Runtime 10.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.5
Magma 2.5.4
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=C:/w/b/windows/tmp_bin/sccache-cl.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -DNDEBUG -DUSE_FBGEMM -DUSE_XNNPACK, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON,

github-actions[bot] commented 2 years ago

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Your Environment";

StudioTV commented 2 years ago

It seems like Detectron2 can't work with uncompressed RLE, so I had to convert it to the compressed version. It was a real struggle, this is how I did it :

Convert the RLE to bitmask with this doc

The code if it can help anyone :

bitmask = rle2mask(string_RLE,shape=(img_h, img_w))   
bitmask = np.asfortranarray(bitmask)   
encoded_ground_truth = pycocotools.mask.encode(bitmask )

And then

                      "segmentation": 
                      {
                          "size": [img_h, img_w],
                          "counts" : encoded_ground_truth["counts"].decode('ascii')
                      },

facebookresearch / detectron2

Error while using RLE for Instance Segmentation #4279

Instructions To Reproduce the 🐛 Bug: