Open soulctcher opened 2 years ago
To bring more to this for those that don't already know, it looks like one must match some aspects of your system. I've gotten beyond the jaxlib error itself simply by changing the FROM entry in backend>Dockerfile to the appropriate one to match my driver revision. I've got a 3080 on driver version 516.59. The FROM entry now looks like this:
FROM nvidia/cuda:11.7.0-devel-ubuntu22.04
Additionally, the jax entry needs to match. in this case, I'm set to:
RUN pip3 install --upgrade "jax[cuda]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
All the above fixes what I was running into, but now I've hit another issue in the chain...an OOM issue:
wandb: Downloading large artifact mega-1:latest, 9873.84MB. 7 files... Done. 0:0:15.9
2022-07-15 01:47:35.718287: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 384.00MiB (rounded to 402653184)requested by op
2022-07-15 01:47:35.720285: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:491] ****************************************************_*********************************************__
2022-07-15 01:47:35.721037: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2129] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 402653184 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 16.4KiB
constant allocation: 64B
maybe_live_out allocation: 5.25GiB
preallocated temp allocation: 304.04MiB
preallocated temp fragmentation: 372B (0.00%)
total allocation: 5.55GiB
So since all of this is new, I'm not really sure what the deal is. As mentioned, I've got a 3080, so there's plenty of memory on the graphics card. My system has 32GB of RAM, so that would be an odd thing for it to not be able to allocate for this purpose either. In any case, if anyone has any ideas, I'm down to try stuff. Thanks in advance.
Interface comes up fine, but the backend just keeps restarting. The following is what is displayed: