JCBrouwer / maua-stylegan2

This is the repo for my experiments with StyleGAN2. There are many like it, but this one is mine. Contains code for the paper Audio-reactive Latent Interpolations with StyleGAN.
https://wavefunk.xyz/audio-reactive-stylegan
179 stars 29 forks source link

Real time capabilities #12

Open pietrobolcato opened 3 years ago

pietrobolcato commented 3 years ago

Hello! :) first of all i wanted to congratulate your work, it's incredible! 🥇

I was wondering if, in your opinion, it would be possible to extend your work to generate the visuals in real-time. This would mean to use streaming of audio data rather than pre-rendered files. Do you think it would be anyhow realistic?

Thank you in advance! keep up the great work! :)

JCBrouwer commented 3 years ago

Hi there, thanks!

Yes, realtime is definitely on my todo list. I think it is feasible, although you'll need a big fat GPU.

Currently, 1920x1080 rendering for my 1024px networks runs at ~15 fps on my 2x1080Ti setup. Including calculating all the latents, noise, and modulations at the start, it takes ~2x the time as what's rendered in total.

However, there are quite a couple speedups that should be fairly easy to pick up + a couple further on the horizon that should help a lot as well:

  1. swapping render.py over to using DistributedDataParallel instead of DataParallel (should be ~1.25-1.5x faster)
  2. training networks directly on 1920x1080 material squashed to 1024x1024 then interpolating output frames back out instead of doubling width at the constant layer (saves almost ~2x amount of pixels being pushed through the network)
  3. using FP16 will give a big speedup if you have a 20 or 30 series Nvidia GPU

(more involved)

  1. quantization, pruning, network compression: these are all typical ways to get inference running faster, although will all tradeoff some generation quality and require quite a bit of work to implement as I haven't seen this done for StyleGAN models anywhere
  2. TorchScript / TorchJIT, ONNX, or TensorRT: these provide faster inference runtimes, although I have no idea how easy it would be to get the rendering pipeline converted for any of these

In any case, in terms of StyleGAN inference speed, I think there's more than enough headroom to get realtime framerates even when calculating latents and noise on the fly.

What I'm a little less certain of are the options in terms of realtime onsets, chroma, and RMS. These are such standard use cases I'd assume that there are nice algorithms for them, although the question remains as to how good they look.

I also know that spleeter is more than fast enough to split in realtime, so I think that would be a great option to alleviate any quality issues of calculating musical features in realtime (it's much easier to get nice drum onsets if the only thing in the audio is the drums!)

I'm not sure when I'll have time to work on things (as I'm doing this in my free time next to a full-time masters + a part-time job), but hopefully I'll slowly knock things off the list over the course of the next months/year.

notlittlebrain commented 2 years ago

I've been looking everywhere for this exact concept and this is the only mention of it that I've found. Stoked that it's actually possible. Any chance it'll become compatible with StyleGan3?

My new PC with a 3060 shows up next week, so I have yet to play around with it, but stoked to hear that this is on the way.

Is there a way to map a midi controller to parameters within software, or run software within Ableton as a M4L device? If so, I feel like this is a true hidden gem and could open the door to the next era of A/V sets.

lowlypalace commented 2 years ago

Interested too, will keep an eye on this thread!

JCBrouwer commented 2 years ago

I've developed a real-time version using Apache TVM for a client's project.

It's possible to run 3 screens of 1920x1080 interactively on a single RTX 3090. So running a single screen should be possible on a 3060.

I'll be integrating similar capabilities into Maua soon. Barebones real-time support (for StyleGAN1, 2, and 3) should be available within a few months, although I'm not sure how to go about TVM support.

@lowlypalace @notlittlebrain @seicaratteri

lowlypalace commented 1 year ago

Hi @JCBrouwer, was wondering if there is any update on this? The project looks amazing, by the way! Thanks!

JCBrouwer commented 1 year ago

Hey, thanks for your interest haha. It's been harder for me to work on this lately as I no longer have local access to my GPU machine, doing real-time stuff over a webconnection isn't the greatest.

I do have the main script for rendering straight to the screen available here. It's just doing a random interpolation, but the idea is to add live audio analysis into the forward function.

I'm nearing the end of a big overhaul of the audio-reactive framework based on my masters thesis. This will include the real-time audio analysis part as well. If you're feeling adventurous though, you can try and add it into the script above yourself.

With a 30 series GPU, TVM shouldn't be needed to get decent framerates (maybe at slightly smaller resolutions).

lowlypalace commented 1 year ago

Hey, @JCBrouwer I wanted to follow up to see if you've had a chance to implement any of these. Thanks!

JCBrouwer commented 1 year ago

Have you given the gpu2gl script a try? What kind of frame-rates were you getting for the random interpolation?

A lot of the changes I've made are tied up in a private branch for a client project I'm working on at the moment. These are mainly just API things though, all the underlying functionality is already available.

For example, GPU-accelerated audio feature calculation can be found here and audio-reactive latent sequence functions can be found here. These functions can be called directly in the forward function of the real-time module in gpu2gl.

This only leaves loading an audio stream, which is really dependent on your specific audio setup.