devforth / gpt-j-6b-gpu-docker

129 stars 32 forks source link

Model tokenizer created 5 secs Killed #2

Open pedrombmachado opened 1 year ago

pedrombmachado commented 1 year ago

Dear Devforth, I am currently getting the follow error:

$ docker run -p 8081:8080 --gpus all --rm -it devforth/gpt-j-6b-gpu
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 2.09MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.05MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 3.12MB/s]
⌚ Model tokenizer created    5 secs
Killed

These are the details of my GPU

$ nvidia-smi
Fri Jan 27 09:23:26 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:13:00.0 Off |                  Off |
| N/A   16C    P8    14W / 150W |      0MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

This is the details about the card

13:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)

OS:

Ubuntu 22.04 LTS
13:00.0 VGA compatible controller: NVIDIA Corporation GM204GL [Tesla M60] (rev a1)
Taiiwo commented 1 year ago

Getting the same issue. It looks like it's running out of RAM? The process is getting killed by something. Maybe some OS config issue. Interestingly it also does the same thing if you run it without any GPUs, so maybe it's not finding the GPU and trying to use the CPU but running out of RAM? Not sure how to fix this.

We are both using different CUDA versions. Both newer than the one in the readme, could that be it? CUDA 11.7

Could try downgrading to 11.6 and see

CoWayger commented 1 year ago

Same issue, Killed means you run out of free RAM, add more RAM or use swapfile on fast storage. Adding solved Killed issue, but I got trapped to pytorch compatibility with A4000.