broadinstitute / superurop-log

1 stars 0 forks source link

Issue running yolact minimal implementation #17

Open eddogola opened 2 months ago

eddogola commented 2 months ago

environment: Colab, L4 GPU, High RAM Batch size: 8, 100 Using torchrun to use torch.distributed on CUDA devices.'

For each batch sizes, I get a warning that os.fork() was called, when it shouldn't be, and that it could cause a deadlock. I am guessing that this is the source of the memory issue I'm facing.

stack trace:

Stack trace

``` /usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() ... return F.conv2d(input, weight, bias, self.stride, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.77 GiB. GPU 0 has a total capacity of 22.17 GiB of which 1.53 GiB is free. Process 711401 has 20.63 GiB memory in use. Of the allocated memory 20.05 GiB is allocated by PyTorch, and 41.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [2024-05-02 01:50:09,326] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 60980) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-05-02_01:50:09 host : fa61ee49c1aa rank : 0 (local_rank: 0) exitcode : 1 (pid: 60980) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ```

eddogola commented 2 months ago

I tried to get the original author's implementation to work, but to no avail -- I got the same os.fork() warning but a different error:

Stack trace:

``` /usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py:313: UserWarning: 'downsample_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it. warnings.warn( /usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py:313: UserWarning: 'pred_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it. warnings.warn( /usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py:313: UserWarning: 'lat_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it. warnings.warn( Initializing weights... Begin training! /usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() Traceback (most recent call last): File "/content/yolact/train.py", line 504, in train() File "/content/yolact/train.py", line 270, in train for datum in data_loader: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__ data = self._next_data() File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1346, in _next_data return self._process_data(data) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 1372, in _process_data data.reraise() File "/usr/local/lib/python3.10/dist-packages/torch/_utils.py", line 722, in reraise raise exception ValueError: Caught ValueError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 51, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/yolact/data/coco.py", line 94, in __getitem__ im, gt, masks, h, w, num_crowds = self.pull_item(index) File "/content/yolact/data/coco.py", line 158, in pull_item img, masks, boxes, labels = self.transform(img, masks, target[:, :4], File "/content/yolact/utils/augmentations.py", line 688, in __call__ return self.augment(img, masks, boxes, labels) File "/content/yolact/utils/augmentations.py", line 55, in __call__ img, masks, boxes, labels = t(img, masks, boxes, labels) File "/content/yolact/utils/augmentations.py", line 309, in __call__ mode = random.choice(self.sample_options) File "mtrand.pyx", line 936, in numpy.random.mtrand.RandomState.choice ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (6,) + inhomogeneous part. ```

eddogola commented 2 months ago

I'm at my wits' end but I think it's a colab specific problem. I tried running locally with the mps device but it takes forever.