[SUGGESTIONS] floating point, dequantize and more

MarcoRavich commented 4 months ago

Hi there, since I've some experiences in this field (audio delossify/upscale) I'd like to share what I have learned:

During a lossy audio treating, the best approach is to carefully decode and preserve it as much as possible: that's why a floating point internal computing and output grants better fidelity;
Operations like normalization (that alters the level of the original signal) should be optional and not part of the process;
Since your software relies on FFMPEG, I strongly suggest you to always use the -drc_scale 0 parameter when decoding lossy sources (even if it's AC3-specific feature);
Always seek for new scientific papers/code - such as @zawi01's Audio Dequantization Using (Co)Sparse (Non)Convex Methods - that may help to achieve better results;
While you've probably already done this, I recommend you to check how Izotope' Spectral Recovery and Stereotool's Delossifier functions works, for possible inspirations.

Last but not least, I've added your project to HyMPS project one (under AUDIO section \ Treatments page \ Enhancing subsection) so check out other interesting open projects to collaborate with.

Hope that helps/inspires !

EDIT I've also opened a 3ad @ hydrogenaudio about Fat Llama, enjoy.

bkraad47 commented 4 months ago

Much appreciated will get to it when I can

MarcoRavich commented 4 months ago

I also suggest you to deeply check this paper (that describes many interesting audio restoration techniques) and, of course, try to collaborate with @abreuwallace that unofficially implemented it here.

bkraad47 commented 3 months ago

Changes made to v-1.1.0 to address some of the topics...cheers

MarcoRavich commented 3 months ago

Bump.

I recalled an old - but still valid - comparison about MP3 decoders quality:

mp3_quality

...and this explains better why fp matters (google-translated from here):

There is another solution for decoding mp3 files that has been around for less than a year and whose technical specifications are quite impressive, and which should offer a quality that is significantly higher than that of MAD. This software is foobar2000, and uses the mpglib library as a decoder, largely modified by Peter Pawlowski. This decoding solution has the disadvantage of not supporting freeformat encodings, but as advantages, both theoretical and practical, we find the following characteristics:

decoding performed in 64 floating bits

integration of a disengageable dithering algorithm, with three noise shaping modes to limit its negative impact

full consideration of offsets, allowing gapless chaining of lame encodings (automatic) and any other mp3 encoder (manual)

full support for replaygain

management of several tag formats (ID3v1, ID3v2.x and APEv2)

All of these qualities are applicable to formats other than mp3, and - this is the most important aspect - are perfectly reproducible in the form of a test PCM file thanks to the complete diskwriter available with the software. Integrating foobar2000 should allow us to measure the contribution of the noise shaping technique, an essential complement to dithering that MAD strangely ignores. This technique should push back the nuisances linked to dithering (background noise) while preserving its contribution (higher resolution). From a theoretical point of view at least, foobar2000 should offer qualitatively superior decoding to that of MAD.

It's unfortunally no longer used/developed (a 2006-source code version shared from Peter here), but someone like @buhadram (that recently forked mpglibdll) can be interested in implement such optimizations.

I also suggest to check project like Audio Delossifier by @kroll-software that may help even more.

Last but not least, the approach you're appying seems very similar to the well-known Spectral band replication but without the encoded-datas for recontruction. This could be similar to what stereotool does in its Delossifier function, but you have to investigate better/deeper. In this field somone like @Abhipray - that some time ago developed an audio codec with SBR logic - may explain you better if it's feasible.

Anyway I believe that these types of alterations shouldn't be mandatory but, just like in stereotool, optionally activated by users.

Hope that helps.

bkraad47 commented 3 months ago

Hi the code was updated in version 1.1.0 to use fp 64 cheers

MarcoRavich commented 3 months ago

Hi the code was updated in version 1.1.0 to use fp 64 cheers

Nice.

Please join the 3ad I've started @ hydrogenaudio 'cause others are asking me things that I sincerely don't know (note that there are many audio and codecs experts there).

Thanks.

MarcoRavich commented 3 months ago

Here are some other interesting projects to collaborate with:

@matthewmcq's upscalemp3;
@WaveGenAI's AudioEnhancer;
@ivandustin's learnaudio.

It would be also interesting to adopt better quality measurement method, such as the well-known Perceptual Evaluation of Audio Quality.

You can evaluate to integrate it in fatllama too: https://github.com/search?q=PEAQ+audio&type=repositories&s=updated

Last but not least, what about a "Jupyter Notebook" to let anyone test fatllama easily ? Check out Google Colab/HF Spaces/Paperspace/Kaggle/Jupiter/Deepnote/etc.

Hope that inspires !

MarcoRavich commented 3 months ago

Look and you shall find

Another interesting - even if speech-oriented - project with many cool resources (= papers) linked in: EnglishSpeechUpsampler by @jhetherly.

ivandustin commented 3 months ago

You can experiment with using transformer models to reconstruct frequencies for upscaling audio. For example, breathing sounds that are audible in CD quality but lost in MP3 could potentially be restored using a transformer model. While the model might introduce sounds that weren't present in the original recording, it can artificially enhance the details of the music. However, this approach requires significant effort and time to develop and fine-tune.

MarcoRavich commented 3 months ago

Thanks for your knowledge sharing @ivandustin !

Seems that well-known @IAHispano choosed @haoheliu's AudioSR for their Audio Upscaler:

At the moment the best approach seems to be the Audio Delossifier by @kroll-software one, which is trainable.

Anyway I would love to see a "scientific" (= spectral ?) quality comparison between all contenders...

...so I've decided to reorganize (and expand) the HyMPS' collection: AUDIO section \ AI-based category \ Enhancers page \ Upscalers

Enjoy !

MarcoRavich commented 3 months ago

Well the HA discussion is going on and - as always - something interesting comes out:

DeltaWave Audio Null Comparator: a free experimental software built to allow detailed comparison of two different captures of the same audio signal.

It performs a wide range of measurements and comparisons, which could be very useful in testing the effectiveness and accuracy of fat llama: please check it out.

EDIT Here's a DeltaWave report - with some graphs - on experiment files you shared:

DeltaWave v2.0.13, 2024-08-20T12:23:57.2062764+02:00 Reference: test_original_flac.flac[L] 7278467 samples 192000Hz 24bits, stereo, MD5=00 Comparison: test_converted_flac.flac[L] 8491546 samples 224000Hz 24bits, stereo, MD5=00 Settings: Gain:True, Remove DC:True Non-linear Gain EQ:False Non-linear Phase EQ: False EQ FFT Size:65536, EQ Frequency Cut: 0Hz - 0Hz, EQ Threshold: -500dB Correct Non-linearity: False Correct Drift:True, Precision:30, Subsample Align:True Non-Linear drift Correction:False Upsample:True, Window:Kaiser Spectrum Window:Kaiser, Spectrum Size:32768 Spectrogram Window:Hann, Spectrogram Size:4096, Spectrogram Steps:2048 Filter Type:FIR, window:Kaiser, taps:262144, minimum phase=False Dither:False bits=0 Trim Silence:True Enable Simple Waveform Measurement: False Resampled Reference to 224000Hz Discarding Reference: Start=0s, End=0s Discarding Comparison: Start=0s, End=0s Initial peak values Reference: -3,358dB Comparison: 0dB Initial RMS values Reference: -20,25dB Comparison: -16,675dB Null Depth=101,926dB Trimming 0 samples at start and 0 samples at the end that are below -90,31dB level X-Correlation offset: -3 samples Trimming 0 samples at start and 0 samples at the end that are below -90,31dB level Drift computation quality, #1: Excellent (0,3μs) Trimmed 171244 samples ( 764,482143ms) front, 224860 samples ( 1003,839286ms end) Final peak values Reference: -3,358dB Comparison: -3,801dB Final RMS values Reference: -20,505dB Comparison: -20,765dB Gain= 3,822dB (1,5528x) DC=0,00001 Phase offset=-0,013447ms (-3,012 samples) Difference (rms) = -32,77dB [-39,68dBA] Correlated Null Depth=42,37dB [41,89dBA] Clock drift: -0,01 ppm Files are NOT a bit-perfect match (match=0,13%) at 16 bits Files are NOT a bit-perfect match (match=0%) at 24 bits Files match @ 49,998% when reduced to 6,5 bits ---- Phase difference (full bandwidth): 143,874834932871° 0-10kHz: 23,92° 0-20kHz: 56,12° 0-24kHz: 66,07° Timing error (rms jitter): 514ns PK Metric (step=400ms, overlap=50%): RMS=-37,5dBFS Median=-46,7 Max=-29,4 99%: -30,89 75%: -34,7 50%: -46,7 25%: -50,74 1%: -55,33 gn=0,644017646514286, dc=1,40872942458951E-05, dr=-8,96573621136896E-09, of=-3,01221162486093 DONE! Signature: 85634bda22fde7ebef2c2e8a0c038e7a RMS of the difference of spectra: -78,1160326779584dB DF Metric (step=400ms, overlap=0%): Median=-18,9dB Max=-10,6dB Min=-30,8dB 1% > -30,78dB 10% > -27,18dB 25% > -24,67dB 50% > -18,88dB 75% > -13,32dB 90% > -12,28dB 99% > -2,19dB Linearity 2,1bits @ 0.5dB error

As already asked, a jupiter notebook (that let users run tests on Colab, for example) would be great for those - like me too - who don't own CUDA-enabled GPUs.

Last but not least, I've also found this interesting repo by @franciscorafart that may help too.

Hope that helps.

bkraad47 / fat_llama

[SUGGESTIONS] floating point, dequantize and more #18