Closed jaychia closed 1 year ago
@jaychia In that link I get a GPU OOM, but I can't find any evidence that multiple tasks are being ran in parallel. I see a single model init and a single udf call before the oom.
GenerateImageFromTextGPU.__init__()
using device cuda
downloading tokenizer params
intializing TextTokenizer
downloading encoder params
initializing DalleBartEncoder
downloading decoder params
initializing DalleBartDecoder
downloading detokenizer params
initializing VQGanDetokenizer
GenerateImageFromTextGPU.__call__(['Photo pour Japanese pagoda and old house in Kyoto at twilight - image libre de droit'])
ERROR:daft.udf:Encountered error when running user-defined function GenerateImageFromTextGPU
---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
[<ipython-input-9-f70ecca7159b>](https://localhost:8080/#) in <module>
34
35 resource_request = ResourceRequest(num_gpus=1) if USE_GPU else None
---> 36 images_df.with_column(
37 "generated_image",
38 GenerateImageFromTextGPU(images_df["TEXT"]),
30 frames
[/usr/local/lib/python3.8/dist-packages/torch/functional.py](https://localhost:8080/#) in einsum(*args)
376 # the path for contracting 0 or 1 time(s) is already optimized
377 # or the user has disabled using opt_einsum
--> 378 return _VF.einsum(equation, operands) # type: ignore[attr-defined]
379
380 path = None
OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 13.96 GiB already allocated; 3.88 MiB free; 13.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The dataframe also only has a single partition.
Do you have more info about the resource request violation?
Ah I could be wrong then - I had thought that was the reason why since nothing else really changed that would potentially cause this issue. Feel free to assign the issue back to me and I can investigate further
@jaychia I'll run with it for a bit longer :)
It does seem to be the multithreading PyRunner PR though. Notebook works on e73ddcd and breaks on the next nightly, 342535e
Filed here:
For now we'll just modify the notebook to do a weight predownload.
@jaychia I couldn't get the predownloading to work within a single notebook run. Could you try giving it a shot?
The only times I've gotten it working involve refreshing the notebook (:thinking:):
Things I've tried to predownload that didn't work:
del MinDalle; del torch; import gc; gc.collect()
in the middle
Describe the bug
See notebook: https://colab.research.google.com/github/Eventual-Inc/Daft/blob/main/tutorials/text_to_image/text_to_image_generation.ipynb#scrollTo=b500e7f5
Our multithreaded PyRunner may not be respecting GPU/CPU requests, running multiple tasks in parallel when it should not be able to.