Open Mudloop opened 1 month ago
my gods you've done it, haven't you
@bitnom, looks like he killed it 👏
Thanks for the responses!
I spent some time updating the design :
This is all css, the only image used is the X.
Processing time was increasing towards 180ms / 1s of audio (that's with max polyphony and max unison), so it was time to vectorize some stuff, and now it's ~70ms again.
I now also do the filters from assemblyscript, which saves a lot of time. Converting compiled faust cpp filters to assemblyscript is trivial, and I plan to vectorize those as well.
So yeah, AssemblyScript is a real champ, if you know how to (ab)use it. Lots of pointer arithmetic going on etc.
Here's some Simd util code I wrote, I find the interpolateTowards method especially nifty, for things like phasors, and interpolating params across a vector. The lerp3 method could potentially be done more efficiently, but it does the trick and isn't bottlenecking me at the moment.
export const interpolator: v128 = f32x4(0, 0.25, .5, .75);
export const splat0_5 = f32x4.splat(0.5);
export const splat0 = f32x4.splat(0);
export const splatMinusOne = f32x4.splat(-1);
export const splat1 = f32x4.splat(1);
export const splat2 = f32x4.splat(2);
export class SimdUtil {
@inline static lerp(a: v128, b: v128, t: v128): v128 {
return f32x4.add(f32x4.mul(t, f32x4.sub(b, a)), a);
}
@inline static interpolateTowards(from: f32, to: f32): v128 {
return f32x4.add(f32x4.mul(interpolator, f32x4.splat(to - from)), f32x4.splat(from));
}
@inline static lerp3(a: v128, b: v128, c: v128, t: v128): v128 {
const ratio = f32x4.mul(t, splat2);
const firstRatio = f32x4.max(splat0, f32x4.min(splat1, ratio));
const secondRatio = f32x4.max(splat0, f32x4.min(splat1, f32x4.sub(ratio, splat1)));
const first = this.lerp(a, b, firstRatio);
const second = this.lerp(b, c, secondRatio);
const firstMul = f32x4.mul(first, f32x4.sub(splat1, secondRatio));
const secondMul = f32x4.mul(second, secondRatio);
return f32x4.add(firstMul, secondMul);
}
@inline static makeBipolar(v: v128): v128 {
return f32x4.sub(f32x4.mul(v, splat2), splat1);
}
@inline static makeUnipolar(v: v128): v128 {
return f32x4.mul(f32x4.add(v, splat1), splat0_5);
}
@inline static clamp(v: v128, min: v128, max: v128): v128 {
return f32x4.min(f32x4.max(min, v), max);
}
@inline static normalize(v: v128): v128 {
return f32x4.sub(v, f32x4.floor(v));
}
@inline static gather_v<T>(pointers: v128): v128 {
let ret: v128 = v128.load_zero<T>(i32x4.extract_lane(pointers, 0));
ret = v128.load_lane<T>(i32x4.extract_lane(pointers, 1), ret, 1);
ret = v128.load_lane<T>(i32x4.extract_lane(pointers, 2), ret, 2);
ret = v128.load_lane<T>(i32x4.extract_lane(pointers, 3), ret, 3);
return ret;
}
@inline static gather<T>(ptr1: usize, ptr2: usize, ptr3: usize, ptr4: usize): v128 {
let ret: v128 = v128.load_zero<T>(ptr1);
ret = v128.load_lane<T>(ptr2, ret, 1);
ret = v128.load_lane<T>(ptr3, ret, 2);
ret = v128.load_lane<T>(ptr4, ret, 3);
return ret;
}
@inline static scatter_v<T>(pointers: v128, value: v128): void {
v128.store_lane<T>(i32x4.extract_lane(pointers, 0), value, 0);
v128.store_lane<T>(i32x4.extract_lane(pointers, 1), value, 1);
v128.store_lane<T>(i32x4.extract_lane(pointers, 2), value, 2);
v128.store_lane<T>(i32x4.extract_lane(pointers, 3), value, 3);
}
@inline static scatter<T>(ptr1: usize, ptr2: usize, ptr3: usize, ptr4: usize, value: v128): void {
v128.store_lane<T>(ptr1, value, 0);
v128.store_lane<T>(ptr2, value, 1);
v128.store_lane<T>(ptr3, value, 2);
v128.store_lane<T>(ptr4, value, 3);
}
}
The main issue now is that the UI has become a bit sluggish, there's a lot of filters and dropshadows, so will probably need to make some images to make re-rendering stuff easier. Everything you see there is done with css, the only image currently is the X in the master section, so lots of opportunities to optimize - the issue being that it will be harder to tweak once I replace styled elements with images.
Interesting find, performance greatly improved by replacing this (which is used in wavetable lookups) :
return i32x4(
load<i32>(i32x4.extract_lane(pointers, 0)),
load<i32>(i32x4.extract_lane(pointers, 1)),
load<i32>(i32x4.extract_lane(pointers, 2)),
load<i32>(i32x4.extract_lane(pointers, 3))
);
by this :
let ret: v128 = f32x4.splat(0);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 0), ret, 0);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 1), ret, 1);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 2), ret, 2);
ret = v128.load_lane<f32>(i32x4.extract_lane(pointers, 3), ret, 3);
return ret;
A bit less pretty, but way faster - made my entire thing about 20% more performant.
This might still not be the optimal way to load a vector from pointers stored in another vector, due to the extraction of the 4 lanes. I tried extracting them all at once into a memory.data slot and using that, but that made things worse.
There's not much info out there on simd in assemblyscript, so I'll share what I find in case it helps someone looking for simd tips.
EDIT : I added scatter / gather methods to the above simd util class, which do this.
Here are some more simd methods that might be useful to someone :
@inline static getPreviousPowerOfTwo(n: v128): v128 {
n = v128.or(n, v128.shr<i32>(n, 1));
n = v128.or(n, v128.shr<i32>(n, 2));
n = v128.or(n, v128.shr<i32>(n, 4));
n = v128.or(n, v128.shr<i32>(n, 8));
n = v128.or(n, v128.shr<i32>(n, 16));
return i32x4.sub(n, v128.shr<i32>(n, 1));
}
@inline static clz_i32(v: v128): v128 {
return i32x4(
clz<i32>(i32x4.extract_lane(v, 0)),
clz<i32>(i32x4.extract_lane(v, 1)),
clz<i32>(i32x4.extract_lane(v, 2)),
clz<i32>(i32x4.extract_lane(v, 3))
);
}
@inline static log2_i32(v: v128): v128 {
return i32x4.sub(splat31_i, this.clz_i32(v));
}
If anyone has a better idea on how to handle the clz operation without extracting the lanes, I'd love to hear it. Or the getPreviousPowerOfTwo method for that matter. It's pretty fast, but I suck at bitwise shenanigans, so it might be doable in some better way.
I use these for a wavetable anti-aliasing algorithm, to find "mipmaps" with a max frequency closest to the current frequency.
Edit : might as well share a recent screenshot :
Question
Not sure if this is allowed here, but I wanted to show one of my AssemblyScript-based projects :
The main brain of the synth is handled with AssemblyScript. It handles parameter management, oscillators (wavetables), modulation and routing, and the filters and effects are done with Faust (with a little help from the host).
It's currently running in the browser (actually an eletron app atm), but the plan is to use iPlug2 and wasmer to turn it into a vst/au/... plugin. It's all set up with that in mind, and I've done some tests to make sure that would work.
The main challenge has been performance - AS is pretty good at that, but there's a ton going on. 44100 samples per second with 12 voices, 2 engines with 7 voices unison and 2 generators each that have multiple stages, it adds up. But after a couple of iterations of the audio engine, it's in a good place. The first time I tried, it took just under a second to generate a second of audio, which isn't acceptable, but the current version manages this in about 100ms, so that's good. And I haven't even vectorized anything yet.
I used Lit to make the UI. Still needs some polish, and there's a lot of placeholders still. I'm not a designer, but I'm pretty happy with what I managed to come up with.