When a Resource Provider is offered a job to compute, if that RP does not have enough resources available, then the job will not succeed. The problem is that the CLI interface does not notify the user that the job was not successful even though its obvious that job failed due to there being no output.
Reproduction
Run a job on a RP that has most of its VRAM allocated to other tasks.
Logs
2024-12-02 23:15:32,414 - INFO - Starting SDXL lightweight script
2024-12-02 23:15:32,414 - INFO - Using prompt: "CLASSIFIED"
2024-12-02 23:15:32,414 - INFO - Loading SDXL-Turbo pipeline
2024-12-02 23:15:32,414 - INFO - Using pre-downloaded model
Couldn't connect to the Hub: (MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /api/models/stabilityai/sdxl-turbo (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x77c580c88550>: Failed to resolve \'huggingface.co\' ([Errno -3] Temporary failure in name resolution)"))'), '(Request ID: 5da4d690-df80-49b9-b404-c1cd694a4da6)').
Will try to load from local cache.
Loading pipeline components...: 0%| | 0/7 [00:00<?, ?it/s]
Loading pipeline components...: 29%|██▊ | 2/7 [00:00<00:00, 7.36it/s]
Loading pipeline components...: 57%|█████▋ | 4/7 [00:00<00:00, 4.78it/s]
Loading pipeline components...: 71%|███████▏ | 5/7 [00:01<00:00, 3.95it/s]
Loading pipeline components...: 100%|██████████| 7/7 [00:01<00:00, 5.86it/s]
2024-12-02 23:15:33,808 - INFO - Using device: cuda
2024-12-02 23:15:42,058 - ERROR - An error occurred: CUDA out of memory. Tried to allocate 14.00 MiB. GPU 0 has a total capacity of 11.60 GiB of which 126.50 MiB is free. Process 261350 has 8.66 MiB memory in use. Process 2850306 has 485.57 MiB memory in use. Process 3710576 has 6.74 GiB memory in use. Of the allocated memory 6.30 GiB is allocated by PyTorch, and 253.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Traceback (most recent call last):
File "/workspace/run_sdxl.py", line 37, in main
pipe = pipe.to(device)
File "/usr/local/lib/python3.9/site-packages/diffusers/pipelines/pipeline_utils.py", line 431, in to
module.to(device, dtype)
File "/usr/local/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2905, in to
return super().to(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1174, in to
return self._apply(convert)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 780, in _apply
module._apply(fn)
[Previous line repeated 3 more times]
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 805, in _apply
param_applied = fn(param)
File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1160, in convert
return t.to(
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB. GPU 0 has a total capacity of 11.60 GiB of which 126.50 MiB is free. Process 261350 has 8.66 MiB memory in use. Process 2850306 has 485.57 MiB memory in use. Process 3710576 has 6.74 GiB memory in use. Of the allocated memory 6.30 GiB is allocated by PyTorch, and 253.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Screenshots
No response
System Info
Resource Providers are running lilypad from source(main branch-Dec 2nd)
Describe the bug
When a Resource Provider is offered a job to compute, if that RP does not have enough resources available, then the job will not succeed. The problem is that the CLI interface does not notify the user that the job was not successful even though its obvious that job failed due to there being no output.
Reproduction
Run a job on a RP that has most of its VRAM allocated to other tasks.
Logs
Screenshots
No response
System Info
Severity
Annoyance