GBA: Micro-optimization, and new ADPCM format for tracks

Aikku93 commented 1 year ago

Hi, really like your project!

On the GBA port, the music is using IMA-ADPCM as its encoding format. While this seems to work, the format has pretty poor quality in some cases due to high-frequency noise, which becomes very audible noise when playing at 10kHz.

Since the music has to be resampled to the native sampling rate before building anyway, I figured that it couldn't be harmful to outright change the encoding format to something better. So what I've done is to use is a variation of linear-predictive coding, combined with a pre-/post-filter pair for noise shaping. In testing, this gave much better perceptual quality, and decoding performance should be faster (0.6% faster if the decoding loop is not unrolled, or 17.6% faster if unrolled), plus a lot less RAM usage due to not needing to store the step table. If necessary, it's also possible to speed up decoding even more by removing the clamping step (9.1% faster than IMA-ADPCM when not unrolled, 26.0% faster when unrolled), at the cost of lower volume, or needing to apply a limiter to the original music prior to encoding. While I realize that sound really isn't the bottleneck here, I figure that every little bit helps.

Because I don't have access to the uncompressed resampled tracks, I've simply transcoded the data from TRACKS.IMA into TRACKS.AD4 for the time being, so the quality improvements aren't there yet (and in fact, become a little bit worse due to this process); if this proposal goes through, then the uncompressed music would need to be re-encoded (I'm happy to provide an encoder implementation; I just wasn't sure how it hooks into the packer).

I've also included a commit for slightly faster clamping in sndPCM_mix() that can be cherry-picked if need be. And as mentioned in b617e12, I've found a possible race condition: music.data is set before the state has been set up, which might cause issues if music.fill() is invoked in between there. The fix is simple enough (store music.data=NULL before setup, and store the final value after storing the rest of the state), so I haven't set it up as a separate commit.

XProger commented 1 year ago

hi, it looks GREAT, can you please check your method on proper converted tracks? I use convertTracks call from gba/packer to generate TRACKS.IMA package, the packer works on the ready to use data, all dirty job is done by ffmpeg and ima tools. Do you have the algorithm description link or wiki?

Aikku93 commented 1 year ago

Just got home from work, and will process the tracks soon.

The algorithm is one I came up with myself while testing variations on traditional ADPCM, but I can explain it briefly:

The input data is treated with the pre-filter: Hpre(z) = 1 - (7/8)z^-1 (a relatively strong high-pass filter; x'[n] = x[n] - x[n-1]*(7/8)). This emphasizes the frequencies that are most prone to audible distortion, so they can be coded more accurately.
For every pre-filtered sample x'[n]:
- A prediction is made by extracting the gradient of the last two encoded samples (pred[n] = y[n-1] - y[n-2]).
- A residue is formed from the real (pre-filtered) sample and the prediction (res[n] = x'[n] - pred[n]).
- The residue is quantized by dividing it by the current quantization level, and rounding to the nearest integer. This is then clamped to the range of -8..+7, and stored to the output stream, with the first sample in the lowest 4 bits, the next sample in the highest 4 bits (X[n]= clamp(round(res[n] / quant[n]), -8, +7)).
- To decode, the residue is multiplied by the quantizer, and added to the prediction (y[n] = X[n]*quant + pred[n]).
- The current quantization level is adjusted based on the absolute level of the residue (quant[n+1] = ceilDiv(quant[n] * AdaptationTable[abs(X[n])], 128), where AdaptationTable[] = {112, 128, 128, 128, 128, 136, 136, 192, 192}).
For the final output, the decoded samples y[n] have the post-filter applied, which is just the inverse of the pre-filter: Hpost(z) = 1 / Hpre(z) = 1 / (1 - (7/8)z^-1) (a relatively strong low-pass filter; y'[n] = y[n] + y[n-1]*(7/8)). This smooths out the signal (and "cancels" the pre-filter), reducing the high-frequency noise introduced by quantization.

None of the procedures are at all obscure for linear-prediction codecs, but I did come up with the values of the adaptation table myself, based on a lot of testing of different sounds and music a few years ago. As a side-note, the predictor can actually be adjusted to be something different than just the gradient (eg. it can be tuned for other frequency responses than just H(z) = 1 / (1 - z^-1 + z^-2)), but in testing, I found that the gradient method gives the best trade-off for all types of music, and is also one of the fastest to implement (other than straight delta-coding, but that one did sound pretty bad).

Kroc commented 1 year ago

For those of us following Git activity, could you post a video comparison of the audio quality? I'm interested to hear!

Aikku93 commented 1 year ago

Ok, so I had to basically strip out the music-packing part of the packer (I don't have all the requisite libraries installed...), and then modify it further to get it working (specifically: The part where it reads the track index needs to stop before the file extension, otherwise the number 4 in the extension "ad4" will be interpreted as being part of the track index; I'll see about adding this change to the source)... But I got it done. :D

I first passed the original audio through an experimental stereo->mono converter I'm working on (to hopefully better preserve the perceptual qualities of the original music), then passed through ffmpeg to resample and convert to raw PCM16, which was then fed into a small program I wrote to handle the encoding.

For the sake of testing, I've padded all the music to 176 sample chunks to avoid glitching on overrun (my ADPCM format will go /very/ wrong when it's given garbage data, and make /very/ loud noises). This should really be done inside music::fill(), but it's getting pretty late, so I will work on that tomorrow.

For now, here's a comparison of the title screen captured using VisualBoyAdvance (like I said: it's late, so I really can't be bothered to hook up hardware at the moment haha): IMA-ADPCM and New ADPCM (encoded with 128kbps Opus). (If I'm missing something when you said "post a video comparison", please let me know...)

The main difference in the noise/distortion is mostly that it's been pushed to lower frequencies, where it's less likely to be objectionable. It's still there (and can be obvious if you're looking for it), but I'm hoping it's a lot less objectionable than "plain" IMA-ADPCM.

XProger commented 1 year ago

I like the result, waiting for the packer fixes to commit 8)

Kroc commented 1 year ago

Much, much cleaner with the strings! There's still a lot of "wind noise" in the low range during the opening bars, but I wonder if that could be filtered out either before or after conversion

Aikku93 commented 1 year ago

Yeah, that "wind noise" is the trade-off of the noise shaping. What it's doing is basically grabbing the broadband noise that would normally happen without noise shaping, and pushes all of it into lower frequencies. This is generally a lot more "pleasing", because most of the noise would otherwise happen at around 3..5kHz (due to the 10kHz sampling rate), where the ear is most sensitive to distortion. As far as I'm aware, the only way to make it better is to use different prediction filters based on context (eg. changing the predictor to whatever works best for a given 128-sample block, or something like that), or dynamically changing the pre-/post-filter so that it falls on frequency bands that have low simultaneous masking levels, but this adds extra complexity to decoding (extra complexity = slower), so I've just left it as-is with the "all-purpose" predictor.

I've also gone ahead and scanned all the tracks for overflow, and adjusted the volume by -1.1dB so as to avoid it (88.1% of the original volume, so it shouldn't be noticeable; this is set in the conversion batch script, so can be easily changed when/if needed), meaning that the clamping step can now be removed, bringing the speed improvement to 9.1% faster than IMA-ADPCM (I'm not sure that unrolling the loop is worth it at the moment - even if it does make the improvement 26.0% faster -, because it does add an extra 300 bytes of code in IWRAM...).

The encoder source has been placed in the packer/ad4 folder for posterity, and the tweaked batch scripts can be found here (including a build of the encoder tool, and a prototype build of the stereo->mono tool I'm working on, both of which I used to create the new TRACKS.AD4). The way to use it is to run convertToMono.bat first to create the mono tracks, and then use convert.bat to actually handle conversion (it currently spits out the peak level of each decoded track, which I was using to detect for overflow potentials).

Finally... I wasn't actually able to get the packer tool built. So the changes I made to the out_GBA.h file have been done blindly, and need testing before merging.

Aikku93 commented 1 year ago

Actually, I did the maths: Even if the loop is fully unrolled, it's still better than IMA-ADPCM. Originally, IMA took 516 bytes of IWRAM (160 bytes of code + 356 bytes for the step table). This version of ADPCM uses 152 bytes of IWRAM for the looped version (136 bytes of code + 16 bytes adaptation table), and 408 bytes of IWRAM for the fully unrolled version (392 bytes of code + 16 bytes adaptation table). Meaning that even fully-unrolled, it will still save 108 bytes of IWRAM.

Ultimately, I guess it just depends on whether it's better to have an extra 108 bytes or 364 bytes to play with.

Aikku93 commented 1 year ago

Alright, I managed to remove the 'wind' noise: https://www.mediafire.com/file/9l75pqizptye1af

It only involves a change to the encoder (ie. packer/ad4/AD4.h), and the conversion batch script (the gain should be set to -0.71dB; there might be other values that don't cause overflow, but that's the one I tested with). I'm not sure what the etiquette is here, so I'll ask before I do anything: Should I add this change as another commit, or re-base at ace1855 and force-push?

XProger commented 1 year ago

@Aikku93 great! Just apply it to the PR, I'll test and merge it asap

Aikku93 commented 1 year ago

>_< I thought I'd fixed the issue where it generates a limit oscillation on silence (some of the voice samples are doing this, I just noticed). It would be unnoticeable if the output was 16bit, but un-rounded 8-bit output amplifies it. I'll try working on it some more, sorry about that.

Aikku93 commented 1 year ago

I've gone through every single track, and made sure that they end in silence rather than an annoying buzzing sound. The conversion script needs to have the gain set to -0.8dB (again, there might be different values that work, but that's what I built with).

Hopefully that should be the last of the issues, so feel free to refactor the commits however you like. :]

XProger commented 1 year ago

perf: 68 (old) vs 62 (new) with less noise, thanks!

Aikku93 commented 1 year ago

You're welcome! And thanks for the cleanup, too - writing assembly using explicitly named registers is very uncomfortable for me, which explains why I missed those issues.

XProger commented 1 year ago

but it's very useful when you need to re-arrange registers for relatively large chunk of the code, for fiq or stm/ldm optimization as example 8)

Aikku93 commented 1 year ago

Haha yeah, I can definitely see that, though (my go-to solution of three Find+Replace cycles is not very friendly towards commit history...).

Aikku93 commented 1 year ago

Small issue: It looks like 76b0348 overwrote data/TRACKS.AD4, replacing it with an older version.

XProger commented 1 year ago

@Aikku93 it replaced by freshly converted TRACKS.AD4

Aikku93 commented 1 year ago

Hm, you're sure that the ad4.exe file was generated from the most recent source code? Its output is different to a freshly-compiled version (it still has that 'wind noise'). And convert.bat also needs to be updated (ad4 temp.raw %OUT_FOLDER%%%~ni.ad4 -1.1 to ad4 temp.raw %OUT_FOLDER%%%~ni.ad4 -0.8; that last parameter is the gain control in dB).

Sorry if that explanation got buried in all the other comments (ie. that the encoder was updated + needs recompilation, and that the batch script needed updating to compensate and ensure no overflows).

XProger commented 1 year ago

ah, my bad, sorry

Aikku93 commented 1 year ago

All good! I'll try to remember to highlight important points in some way, in future.

Aikku93 commented 1 year ago

Uh, sorry to bring another issue up: The ad4.exe included in the 'tracks' folder has a compiler bug, leading to the gain parameter being ignored (which results in overflows in the output).

The problem is: atof() is returning a value in xmm0, but the rest of the code expects the return value to be in eax for some reason (specifically: call atof; movsd xmm0, [Literal 10.0]; pxor xmm1, xmm1; cvtsi2sd xmm1, eax; divsd xmm1, [Literal 20.0]; call pow;).

XProger / OpenLara

GBA: Micro-optimization, and new ADPCM format for tracks #452