Optimized decoder for WebAssembly

Borketh / hardqoi

High performance vectorized implementation of the Quite Ok Image Format written in Rust and inline assembly.

BSD 3-Clause "New" or "Revised" License

16 stars 1 forks source link

Optimized decoder for WebAssembly #4

Open thelamer opened 1 year ago

thelamer commented 1 year ago

Feel free to simply close out this issue if you are not interested but we just implemented QOI image format for VNC to deliver lossless remote desktops using Rust WASM clientside here: https://github.com/kasmtech/noVNC/tree/master/core/decoders/qoi Some docs here: https://www.kasmweb.com/docs/latest/how_to/lossless.html

I have been wondering if SIMD optimizations were even possible on the server side for some time now, I tried out the stable branch with ssse3 and did see +- 10% in encoding speed vs rapid qoi depending on what image you feed to it. Looks like offloading the hashing has some promise especially once the AVX stuff is implemented. Though I am specifically reaching out if you think the decoding could be sped up in a web browser? The compiled blob linked earlier in noVNC is a modified version of this implementation: https://github.com/lukeflima/qoi-viewer

This is all functional, but under high load scenarios you need a pretty beefy client to maintain FPS at a gigabit. Even a small improvement on the web assembly side would have a large impact on overall smoothness of desktop delivery. Anything we do for desktop delivery is open source including these changes if possible.

Essentially I am wondering if you would be interested in some side work to put together a highly optimized open source WASM qoi decoder that takes a Uint8Array as input and spits back "ImageData" as a uint8clamped array and size information. We do 24 bit qoi without the alpha channel.

Borketh commented 1 year ago

Hi @thelamer ! WASM is a planned target for optimizations, but I haven't looked in to it much yet. I was working on some further optimizations and restructuring on the ssse3 and x86-64 part in general, but I may have lost my work (currently trying to find it on a potentially borked disk image as we speak, oh boy). My road map was basically to work my way up the features sets of x86 before moving on to ARM, and then potentially WASM.

I don't know all that much about the latter two platforms, but I went in to x86 stuff without knowing anything either, so I can just learn the same way. What I am aware of, however, is that Rust only targets wasm32 at the moment (correct me if I'm wrong). Some of the optimizations (including the single-pixel hash function inspired by rapid-qoi) depend on being within a 64-bit integer, so those would have to be stripped. I also don't know the extent of the range of SIMD options there are in WASM. I assume that they have to be more general to make them platform-independent. There may not be some of the instructions I would need to optimize easily, but I can definitely try. The potential is potentially there, I think (lol).

If you want, I can try WASM after I finish x86 (after I'm done the base stuff and ssse3, the rest of the instructions won't take very long).

thelamer commented 1 year ago

@AstroFloof sorry for the delay missed this ping. Yes SIMD instructions in WebAssembly is very limited. I am not very low level, my programming experience has mostly revolved around my work at Linuxserver.io building out web apps in my spare time. When it comes to making something new generally that is plugging off the shelf components into each other as was done with existing wasm qoi decode logic and the noVNC project.

So in this case my focus is on any optimizations even if small that could be made to the decoding reference implementation I used which directly translates to lower CPU clientside and higher FPS. Thought I would reach out to anyone trying to improve the QOI v1 spec for decoding/encoding which is a very small list. Right now performance is pretty good on higher end modern CPUs:

https://user-images.githubusercontent.com/1852688/211933613-602be012-fedf-44a2-b4b4-77a032e312d0.mp4

An easy way to see this first hand would be to run docker run --rm -it --shm-size=512m -p 6901:6901 -e VNC_PW=password kasmweb/ubuntu-focal-desktop:1.12.0 https://localhost:6901 user: kasm_user pass: password Then swap to lossless under settings > stream quality. (use Chomium based browser for best results)

From a development standpoint it would just involve building and swapping out the wasm blob and function names in: https://github.com/kasmtech/noVNC/blob/master/core/decoders/qoi/decoder.js#L256-L277 and seeing if it can eek out anymore FPS. (inside the docker image at /usr/share/kasmvnc/www/core/decoders/qoi/)

If you setup a Github Sponsorship on your account I would be happy to toss you some money just for looking into it. I'm interested if it is even possible.

Borketh commented 1 year ago

Hello again @thelamer !

I'm honoured that you'd consider sponsoring this project, and I would love to work on a WASM decoder (and later an encoder). I should warn you not to expect anything, however. I don't know anything about WASM yet (although I knew nothing about x86 before starting the project, so I will learn, of course), and I'm not sure how long it'll take me to make a MVP to start optimizing. Additionally, most if not all of the SIMD-related optimizations are focused on hashing every pixel before beginning the encoding process. I do know of techniques I've used to speed up decode that I can attempt to use though. Whatever the final product may be, I'll certainly do my best!

I intend to release a new version of the x86-64 encoder/decoder very soon, so I'll start after that.

Feyko commented 1 year ago

Hey @thelamer! I'm a friend of Floof that introduced him to QOI and followed his hardqoi developments since At one point I wondered if a QOI GPU endec was possible but abandoned the idea after learning about the relative high latency for GPU API initialisation. This however wouldn't be a problem if the API context is reused when used for something like remote desktop streams. There are (many) other challenges with QOI on GPU but it seems like something worth looking into I'll admit I got nerd-sniped and may try to revive the idea, though I am confused why you'd use an image format like QOI instead of a video format which grants higher compression Could you explain why you made that choice? I'm available on Discord (Feyko#7953) if you want a more interactive chat

Feyko commented 1 year ago

Welp, quick update on this. Did some more investigating and I think QOI on GPU really is a dead-end :P

thelamer commented 1 year ago

@AstroFloof so I have been pumping decode code through different AI language models and some of them seem to think the hashing is not needed and it can be more efficiently performed with an array. This is all greek to me but does this make any sense to you?

use std::io::Read;

fn decode_qoi(reader: impl Read) -> Result<Vec<u8>, std::io::Error> {
    let mut buf = vec![0; 16];
    reader.read_exact(&mut buf)?;

    let magic = u32::from_be_bytes(buf[0..4].to_vec());
    let width = u32::from_be_bytes(buf[4..8].to_vec());
    let height = u32::from_be_bytes(buf[8..12].to_vec());
    let channels = buf[12] as u8;
    let colorspace = buf[13] as u8;

    if magic != 0x716f6966 {
        return Err(std::io::Error::new(std::io::ErrorKind::InvalidData, "Invalid QOI magic"));
    }

    if width == 0 || height == 0 || channels < 3 || channels > 4 || colorspace > 1 {
        return Err(std::io::Error::new(std::io::ErrorKind::InvalidData, "Invalid QOI header"));
    }

    let mut pixels = vec![0; width * height * channels];

    let mut run = 0;
    let mut prev_color = [0; 4];
    for i in 0..width * height {
        if run > 0 {
            run -= 1;
            continue;
        }

        let op = reader.read_u8()?;
        match op {
            0xfe => {
                prev_color[0] = reader.read_u8()?;
                prev_color[1] = reader.read_u8()?;
                prev_color[2] = reader.read_u8()?;
            }
            0xff => {
                prev_color[0] = reader.read_u8()?;
                prev_color[1] = reader.read_u8()?;
                prev_color[2] = reader.read_u8()?;
                prev_color[3] = reader.read_u8()?;
            }
            0x00..=0x3f => {
                let index = op as usize;
                for j in 0..channels {
                    pixels[i * channels + j] = prev_color[j] + index;
                }
            }
            0x40..=0x7f => {
                let index = (op & 0x3f) as usize;
                for j in 0..channels {
                    pixels[i * channels + j] = prev_color[j] + ((op >> 4) & 0x03) - 2;
                }
            }
            0x80..=0xbf => {
                let value = (op & 0x3f) - 32;
                for j in 0..channels {
                    pixels[i * channels + j] = value;
                }
            }
            0xc0..=0xff => {
                let runs = op & 0x3f;
                run = runs;
            }
            _ => {
                return Err(std::io::Error::new(std::io::ErrorKind::InvalidData, "Invalid QOI op"));
            }
        }
    }

    Ok(pixels)
}

Borketh commented 1 year ago

This doesn't make any sort of sense. LLMs just predict the next token and have no idea what they're doing.

thelamer commented 1 year ago

This doesn't make any sort of sense. LLMs just predict the next token and have no idea what they're doing.

Thanks for taking a look, I figured as much I could not get this to run.