Open phalexo opened 1 year ago
your best options are torch.compile for better inference time at the cost of compile time (pytorch>=2.0.0 on linux only, though i haven't noticed significant improvements), or to reduce the number of steps per stage at the cost of image quality. Because the models are run one after another, having them on multiple gpus doesn't speed up inference, unless there is a way to parallelize them. the titan x maxwell only gets like 6 TFLOPS, which is not a lot by today's standards. I'm getting about 6.3 it/s on stage 1 with an rtx 3090 (~36 TFLOPS), so you can probably expect about 1 it/s. these are large, computationally expensive models compared to stable diffusion. the first stage UNET alone is about 4x larger than the entire sd model
Mine are overclocked and get about 9Tflops. That said, I am comparing Floyd with SD V1 running on the same GPU, at the same clock rate. I got some really spectacular pictures out of SD V1. Of several prompts I have tried with Floyd the results are unimpressive. I also thought it was good at putting text elements into a photo. I tried to describe a futuristic house with a realtor sign with some text in front. The model just totally ignored my instructions wrt. to the sign.
I am running the dream pipeline on 4 water-cooled Maxwell Titan X, with each stage on its own GPU. It is slow as molasses. It is painful to watch.
There are no OOMs, stages do fit into 12.3MiB that each Titan has.
Any suggestions are welcome.