AssemblyScript / assemblyscript

A TypeScript-like language for WebAssembly.
https://www.assemblyscript.org
Apache License 2.0
16.75k stars 653 forks source link

Just showing what I'm doing with AssemblyScript #2862

Open Mudloop opened 1 month ago

Mudloop commented 1 month ago

Question

Not sure if this is allowed here, but I wanted to show one of my AssemblyScript-based projects :

localhost_3000_main_window(synth)

The main brain of the synth is handled with AssemblyScript. It handles parameter management, oscillators (wavetables), modulation and routing, and the filters and effects are done with Faust (with a little help from the host).

It's currently running in the browser (actually an eletron app atm), but the plan is to use iPlug2 and wasmer to turn it into a vst/au/... plugin. It's all set up with that in mind, and I've done some tests to make sure that would work.

The main challenge has been performance - AS is pretty good at that, but there's a ton going on. 44100 samples per second with 12 voices, 2 engines with 7 voices unison and 2 generators each that have multiple stages, it adds up. But after a couple of iterations of the audio engine, it's in a good place. The first time I tried, it took just under a second to generate a second of audio, which isn't acceptable, but the current version manages this in about 100ms, so that's good. And I haven't even vectorized anything yet.

I used Lit to make the UI. Still needs some polish, and there's a lot of placeholders still. I'm not a designer, but I'm pretty happy with what I managed to come up with.

bitnom commented 1 month ago

my gods you've done it, haven't you

JairusSW commented 1 month ago

@bitnom, looks like he killed it 👏

Mudloop commented 1 month ago

Thanks for the responses!

I spent some time updating the design :

localhost_3000_(synth retina) (1)

This is all css, the only image used is the X.

Processing time was increasing towards 180ms / 1s of audio (that's with max polyphony and max unison), so it was time to vectorize some stuff, and now it's ~70ms again.

I now also do the filters from assemblyscript, which saves a lot of time. Converting compiled faust cpp filters to assemblyscript is trivial, and I plan to vectorize those as well.

So yeah, AssemblyScript is a real champ, if you know how to (ab)use it. Lots of pointer arithmetic going on etc.

Here's some Simd util code I wrote, I find the interpolateTowards method especially nifty, for things like phasors, and interpolating params across a vector. The lerp3 method could potentially be done more efficiently, but it does the trick and isn't bottlenecking me at the moment.

export const interpolator: v128 = f32x4(0, 0.25, .5, .75);
export const splat0_5 = f32x4.splat(0.5);
export const splat0 = f32x4.splat(0);
export const splatMinusOne = f32x4.splat(-1);
export const splat1 = f32x4.splat(1);
export const splat2 = f32x4.splat(2);

export class SimdUtil {
    @inline static lerp(a: v128, b: v128, t: v128): v128 {
        return f32x4.add(f32x4.mul(t, f32x4.sub(b, a)), a);
    }
    @inline static interpolateTowards(from: f32, to: f32): v128 {
        return f32x4.add(f32x4.mul(interpolator, f32x4.splat(to - from)), f32x4.splat(from));
    }
    @inline static lerp3(a: v128, b: v128, c: v128, t: v128): v128 {
        const ratio = f32x4.mul(t, splat2);
        const firstRatio = f32x4.max(splat0, f32x4.min(splat1, ratio));
        const secondRatio = f32x4.max(splat0, f32x4.min(splat1, f32x4.sub(ratio, splat1)));
        const first = this.lerp(a, b, firstRatio);
        const second = this.lerp(b, c, secondRatio);
        const firstMul = f32x4.mul(first, f32x4.sub(splat1, secondRatio));
        const secondMul = f32x4.mul(second, secondRatio);
        return f32x4.add(firstMul, secondMul);
    }
    @inline static makeBipolar(v: v128): v128 {
        return f32x4.sub(f32x4.mul(v, splat2), splat1);
    }
    @inline static makeUnipolar(v: v128): v128 {
        return f32x4.mul(f32x4.add(v, splat1), splat0_5);
    }
    @inline static clamp(v: v128, min: v128, max: v128): v128 {
        return f32x4.min(f32x4.max(min, v), max);
    }
    @inline static normalize(v: v128): v128 {
        return f32x4.sub(v, f32x4.floor(v));
    }
    @inline static gather_v<T>(pointers: v128): v128 {
        let ret: v128 = v128.load_zero<T>(i32x4.extract_lane(pointers, 0));
        ret = v128.load_lane<T>(i32x4.extract_lane(pointers, 1), ret, 1);
        ret = v128.load_lane<T>(i32x4.extract_lane(pointers, 2), ret, 2);
        ret = v128.load_lane<T>(i32x4.extract_lane(pointers, 3), ret, 3);
        return ret;
    }
    @inline static gather<T>(ptr1: usize, ptr2: usize, ptr3: usize, ptr4: usize): v128 {
        let ret: v128 = v128.load_zero<T>(ptr1);
        ret = v128.load_lane<T>(ptr2, ret, 1);
        ret = v128.load_lane<T>(ptr3, ret, 2);
        ret = v128.load_lane<T>(ptr4, ret, 3);
        return ret;
    }
    @inline static scatter_v<T>(pointers: v128, value: v128): void {
        v128.store_lane<T>(i32x4.extract_lane(pointers, 0), value, 0);
        v128.store_lane<T>(i32x4.extract_lane(pointers, 1), value, 1);
        v128.store_lane<T>(i32x4.extract_lane(pointers, 2), value, 2);
        v128.store_lane<T>(i32x4.extract_lane(pointers, 3), value, 3);
    }
    @inline static scatter<T>(ptr1: usize, ptr2: usize, ptr3: usize, ptr4: usize, value: v128): void {
        v128.store_lane<T>(ptr1, value, 0);
        v128.store_lane<T>(ptr2, value, 1);
        v128.store_lane<T>(ptr3, value, 2);
        v128.store_lane<T>(ptr4, value, 3);
    }

}

The main issue now is that the UI has become a bit sluggish, there's a lot of filters and dropshadows, so will probably need to make some images to make re-rendering stuff easier. Everything you see there is done with css, the only image currently is the X in the master section, so lots of opportunities to optimize - the issue being that it will be harder to tweak once I replace styled elements with images.

Mudloop commented 1 month ago

Interesting find, performance greatly improved by replacing this (which is used in wavetable lookups) :

return i32x4(
    load<i32>(i32x4.extract_lane(pointers, 0)),
    load<i32>(i32x4.extract_lane(pointers, 1)),
    load<i32>(i32x4.extract_lane(pointers, 2)),
    load<i32>(i32x4.extract_lane(pointers, 3))
);

by this :

let ret: v128 = f32x4.splat(0);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 0), ret, 0);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 1), ret, 1);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 2), ret, 2);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 3), ret, 3);
return ret;

A bit less pretty, but way faster - made my entire thing about 20% more performant.

This might still not be the optimal way to load a vector from pointers stored in another vector, due to the extraction of the 4 lanes. I tried extracting them all at once into a memory.data slot and using that, but that made things worse.

There's not much info out there on simd in assemblyscript, so I'll share what I find in case it helps someone looking for simd tips.

EDIT : I added scatter / gather methods to the above simd util class, which do this.

Mudloop commented 2 weeks ago

Here are some more simd methods that might be useful to someone :

    @inline static getPreviousPowerOfTwo(n: v128): v128 {
        n = v128.or(n, v128.shr<i32>(n, 1));
        n = v128.or(n, v128.shr<i32>(n, 2));
        n = v128.or(n, v128.shr<i32>(n, 4));
        n = v128.or(n, v128.shr<i32>(n, 8));
        n = v128.or(n, v128.shr<i32>(n, 16));
        return i32x4.sub(n, v128.shr<i32>(n, 1));
    }
    @inline static clz_i32(v: v128): v128 {
        return i32x4(
            clz<i32>(i32x4.extract_lane(v, 0)),
            clz<i32>(i32x4.extract_lane(v, 1)),
            clz<i32>(i32x4.extract_lane(v, 2)),
            clz<i32>(i32x4.extract_lane(v, 3))
        );
    }
    @inline static log2_i32(v: v128): v128 {
        return i32x4.sub(splat31_i, this.clz_i32(v));
    }

If anyone has a better idea on how to handle the clz operation without extracting the lanes, I'd love to hear it. Or the getPreviousPowerOfTwo method for that matter. It's pretty fast, but I suck at bitwise shenanigans, so it might be doable in some better way.

I use these for a wavetable anti-aliasing algorithm, to find "mipmaps" with a max frequency closest to the current frequency.

Edit : might as well share a recent screenshot :

localhost_3000_ (2)