Closed eLeschke closed 2 years ago
It seems to be (CPU) memory overflow. Can you try to allocate a larger memory?
It seems to be (CPU) memory overflow. Can you try to allocate a larger memory? #9
Thank you for the answer, do you know how much memory i need at least?
We tested these codes on a machine with 188G memory.
Is it possible to bypass the required RAM, maybe by saving the patches on the disk instead of RAM?
I think so, but we don't have these codes. You can have a try.
i tried to run inference with a smaller sample of images now and i got the following error:
""" Traceback (most recent call last): File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, **kwds)) File "./tools/merge_patches.py", line 47, in run_inst newmask_refined[y:y+h, x:x+w] += patch_mask ValueError: operands could not be broadcast together with shapes (11,64) (64,64) (11,64) """
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./tools/merge_patches.py", line 111, in
Can you help me with that? how is it possible that the shapes differ?
You can check whether the index is out of bounds, e.g. y < 0 or y+h > the hight of image. It may be because the patch does not correspond to the image or the image size is smaller than 64x64.
Thank you very much!
Hello, When i try to run inference with the pre trained model, i get an error but i don't get the reason. The code runs on a colab notebook with 25GB RAM, could this be the problem? Thanks in advance!
command:
IOU_THRESH=0.25 \ IMG_DIR=/content/drive/MyDrive/datasets/coco/val2017 \ GT_JSON=/content/drive/MyDrive/datasets/coco/annotations/instances_val2017.json \ GPUS=1 \ sh tools/inference_coco.sh \ configs/bpr/hrnet18s_128.py \ /content/drive/MyDrive/BPR/hrnet18s_coco-c172955f.pth \ /content/drive/MyDrive/Bmask_coco_instances_json_results/coco_instances_results.json \ bmask_coco_instances_results_refined
return:
DATA_ROOT=bmask_coco_instances_results_refined/patches bash ./tools/dist_test_float.sh configs/bpr/hrnet18s_128.py /content/drive/MyDrive/BPR/hrnet18s_coco-c172955f.pth 1 --out bmask_coco_instances_results_refined/refined.pkl /usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py:186: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionsFutureWarning, 2021-11-14 16:00:38,626 - mmseg - INFO - Loaded 265813 images /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) load checkpoint from local path: /content/drive/MyDrive/BPR/hrnet18s_coco-c172955f.pth [>>] 265813/265813, 23.2 task/s, elapsed: 11473s, ETA: 0stcmalloc: large alloc 1453056000 bytes == 0x55764ff10000 @ 0x7fcd6cfe02a4 0x5573dd2994cc 0x5573dd3551a2 0x5573dd34d7df 0x5573dd34d8a8 0x5573dd34f21b 0x5573dd34e307 0x5573dd23c255 0x5573dd34f3ac 0x5573dd34dfda 0x5573dd34dce7 0x5573dd34cf3c 0x5573dd23e992 0x5573dd3b1838 0x5573dd29e7da 0x5573dd31118e 0x5573dd30a9ee 0x5573dd29e271 0x5573dd29e698 0x5573dd30cfe4 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30fd00 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30b915 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30b915 0x5573dd29dafa 0x5573dd30b915 tcmalloc: large alloc 2179604480 bytes == 0x5576c476c000 @ 0x7fcd6cfe02a4 0x5573dd2994cc 0x5573dd3551a2 0x5573dd34d7df 0x5573dd34d8a8 0x5573dd34f21b 0x5573dd34e307 0x5573dd23c255 0x5573dd34f3ac 0x5573dd34dfda 0x5573dd34dcc2 0x5573dd34cf3c 0x5573dd23e992 0x5573dd3b1838 0x5573dd29e7da 0x5573dd31118e 0x5573dd30a9ee 0x5573dd29e271 0x5573dd29e698 0x5573dd30cfe4 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30fd00 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30b915 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30b915 0x5573dd29dafa 0x5573dd30b915 tcmalloc: large alloc 3269427200 bytes == 0x55777400c000 @ 0x7fcd6cfe02a4 0x5573dd2994cc 0x5573dd3551a2 0x5573dd34d7df 0x5573dd34d8a8 0x5573dd34f21b 0x5573dd34e307 0x5573dd23c255 0x5573dd34f3ac 0x5573dd34dfda 0x5573dd34dd31 0x5573dd34cf3c 0x5573dd23e992 0x5573dd3b1838 0x5573dd29e7da 0x5573dd31118e 0x5573dd30a9ee 0x5573dd29e271 0x5573dd29e698 0x5573dd30cfe4 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30fd00 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30b915 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30b915 0x5573dd29dafa 0x5573dd30b915 tcmalloc: large alloc 4904157184 bytes == 0x5578798a8000 @ 0x7fcd6cfe02a4 0x5573dd2994cc 0x5573dd3551a2 0x5573dd34d7df 0x5573dd34d8a8 0x5573dd34f21b 0x5573dd34e307 0x5573dd23c255 0x5573dd34f3ac 0x5573dd34dfda 0x5573dd34dd0c 0x5573dd34cf3c 0x5573dd23e992 0x5573dd3b1838 0x5573dd29e7da 0x5573dd31118e 0x5573dd30a9ee 0x5573dd29e271 0x5573dd29e698 0x5573dd30cfe4 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30fd00 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30b915 0x5573dd30a9ee 0x5573dd29dbda 0x5573dd30b915 0x5573dd29dafa 0x5573dd30b915 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 3020) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 193, in
main()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
./tools/test_float.py FAILED
Failures: