vram requirements - Githubissues

kanttouchthis commented 1 year ago

the readme lists a minimum of 16GB of vram without the stable-x4 upscaler, 24GB with, however you can run it with the stable-x4 on as little as 6GB of vram using sequential offload on the first stage/text encoder (in fp16) and cpu offload on the second/third stage. you can also run all three stages using cpu offload on 16GB (maybe less). you do need sufficient dram though.

  stage_1 = IFPipeline.from_pretrained(
      "DeepFloyd/IF-I-XL-v1.0",
      variant="fp16",
      torch_dtype=torch.float16,
  )
  stage_2 = IFSuperResolutionPipeline.from_pretrained(
      "DeepFloyd/IF-II-L-v1.0",
      text_encoder=None,
      variant="fp16",
      torch_dtype=torch.float16,
  )
  stage_3 = DiffusionPipeline.from_pretrained(
      "stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16
  )

#16 GB
stage_1.enable_model_cpu_offload()
stage_2.enable_model_cpu_offload()
stage_3.enable_model_cpu_offload()

#6 GB
stage_1.enable_sequential_cpu_offload()
stage_2.enable_model_cpu_offload()
stage_3.enable_model_cpu_offload()

i tested this on pytorch2.0.0+cu118 with torch.cuda.set_per_process_memory_fraction() to limit the amount of vram torch can use. the sequential offload significantly slows down the first stage, but that's better than not being able to run it at all

Anatoly03 commented 1 year ago

I bought PC components two days ago (with the plan of going for SD, now that this is out...) and now that the minimum requirement grew to 16 i regret not going for intel arc, but sticking to the rtx 3060 😂

You are a life saviour. I will surely try this out when the components arrive!

neonsecret commented 1 year ago

see https://github.com/deep-floyd/IF/pull/61

tildebyte commented 1 year ago

@kanttouchthis; What is inference speed like when running it this way (and what are the hardware specs)?

Trimad commented 1 year ago

This didn't work on an RTX 4080 with 16GB of VRAM.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 GiB. GPU 0 has a total capacty of 15.99 GiB of which 10.82 GiB is free. Of the allocated memory 2.11 GiB is allocated by PyTorch, and 729.21 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

tildebyte commented 1 year ago

I know that this will probably sound a certain way, but: is this even English? Personally, I'm sick of Torch's horrible technical writing...

torch.cuda.set_per_process_memory_fraction

Set memory fraction for a process. The fraction is used to limit an caching allocator to allocated memory on a CUDA device. The allowed value equals the total visible memory multiplied fraction. If trying to allocate more than the allowed value in a process, will raise an out of memory error in allocator.

Gitterman69 commented 1 year ago

id love to see a full script and not some random snippets.....

deep-floyd / IF

vram requirements #66