Open eddogola opened 2 months ago
I tried to get the original author's implementation to work, but to no avail -- I got the same os.fork() warning but a different error:
```
/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py:313: UserWarning: 'downsample_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py:313: UserWarning: 'pred_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/jit/_recursive.py:313: UserWarning: 'lat_layers' was found in ScriptModule constants, but it is a non-constant submodule. Consider removing it.
warnings.warn(
Initializing weights...
Begin training!
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
self.pid = os.fork()
Traceback (most recent call last):
File "/content/yolact/train.py", line 504, in
I'm at my wits' end but I think it's a colab specific problem. I tried running locally with the mps
device but it takes forever.
environment: Colab, L4 GPU, High RAM Batch size: 8, 100 Using torchrun to use torch.distributed on CUDA devices.'
For each batch sizes, I get a warning that os.fork() was called, when it shouldn't be, and that it could cause a deadlock. I am guessing that this is the source of the memory issue I'm facing.
stack trace:
Stack trace
``` /usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock. self.pid = os.fork() ... return F.conv2d(input, weight, bias, self.stride, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.77 GiB. GPU 0 has a total capacity of 22.17 GiB of which 1.53 GiB is free. Process 711401 has 20.63 GiB memory in use. Of the allocated memory 20.05 GiB is allocated by PyTorch, and 41.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [2024-05-02 01:50:09,326] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 60980) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-02_01:50:09
host : fa61ee49c1aa
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 60980)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
```