antiHUMANDesigns / dall-e-mega

MIT License
7 stars 1 forks source link

Backend restart loop when using container #4

Open soulctcher opened 2 years ago

soulctcher commented 2 years ago

Interface comes up fine, but the backend just keeps restarting. The following is what is displayed:


  File "app.py", line 8, in <module>
    import jax
  File "/usr/local/lib/python3.8/dist-packages/jax/__init__.py", line 116, in <module>
    from .experimental.maps import soft_pmap as soft_pmap
  File "/usr/local/lib/python3.8/dist-packages/jax/experimental/maps.py", line 26, in <module>
    from .. import numpy as jnp
  File "/usr/local/lib/python3.8/dist-packages/jax/numpy/__init__.py", line 19, in <module>
    from . import fft as fft
  File "/usr/local/lib/python3.8/dist-packages/jax/numpy/fft.py", line 17, in <module>
    from jax._src.numpy.fft import (
  File "/usr/local/lib/python3.8/dist-packages/jax/_src/numpy/fft.py", line 19, in <module>
    from jax import lax
  File "/usr/local/lib/python3.8/dist-packages/jax/lax/__init__.py", line 332, in <module>
    from jax._src.lax.fft import (
  File "/usr/local/lib/python3.8/dist-packages/jax/_src/lax/fft.py", line 145, in <module>
    xla.backend_specific_translations['cpu'][fft_p] = pocketfft.pocketfft
AttributeError: module 'jaxlib.pocketfft' has no attribute 'pocketfft'```
soulctcher commented 2 years ago

To bring more to this for those that don't already know, it looks like one must match some aspects of your system. I've gotten beyond the jaxlib error itself simply by changing the FROM entry in backend>Dockerfile to the appropriate one to match my driver revision. I've got a 3080 on driver version 516.59. The FROM entry now looks like this: FROM nvidia/cuda:11.7.0-devel-ubuntu22.04

Additionally, the jax entry needs to match. in this case, I'm set to: RUN pip3 install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

All the above fixes what I was running into, but now I've hit another issue in the chain...an OOM issue:

wandb: Downloading large artifact mega-1:latest, 9873.84MB. 7 files... Done. 0:0:15.9
2022-07-15 01:47:35.718287: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 384.00MiB (rounded to 402653184)requested by op 
2022-07-15 01:47:35.720285: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:491] ****************************************************_*********************************************__
2022-07-15 01:47:35.721037: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2129] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 402653184 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    16.4KiB
              constant allocation:        64B
        maybe_live_out allocation:    5.25GiB
     preallocated temp allocation:  304.04MiB
  preallocated temp fragmentation:       372B (0.00%)
                 total allocation:    5.55GiB

So since all of this is new, I'm not really sure what the deal is. As mentioned, I've got a 3080, so there's plenty of memory on the graphics card. My system has 32GB of RAM, so that would be an odd thing for it to not be able to allocate for this purpose either. In any case, if anyone has any ideas, I'm down to try stuff. Thanks in advance.