bkraad47 / fat_llama

fat_llama is a Python package for upscaling audio files to FLAC or WAV formats using advanced audio processing techniques. It utilizes CUDA-accelerated calculations to enhance audio quality by upsampling and adding missing frequencies through FFT, resulting in richer and more detailed audio.
https://pypi.org/project/fat-llama/
BSD 3-Clause "New" or "Revised" License
7 stars 0 forks source link

[SUGGESTIONS] floating point, dequantize and more #18

Open MarcoRavich opened 4 months ago

MarcoRavich commented 4 months ago

Hi there, since I've some experiences in this field (audio delossify/upscale) I'd like to share what I have learned:

Last but not least, I've added your project to HyMPS project one (under AUDIO section \ Treatments page \ Enhancing subsection) so check out other interesting open projects to collaborate with.

Hope that helps/inspires !

EDIT I've also opened a 3ad @ hydrogenaudio about Fat Llama, enjoy.

bkraad47 commented 4 months ago

Much appreciated will get to it when I can

MarcoRavich commented 4 months ago

I also suggest you to deeply check this paper (that describes many interesting audio restoration techniques) and, of course, try to collaborate with @abreuwallace that unofficially implemented it here.

bkraad47 commented 3 months ago

Changes made to v-1.1.0 to address some of the topics...cheers

MarcoRavich commented 3 months ago

Bump.

I recalled an old - but still valid - comparison about MP3 decoders quality:

mp3_quality

...and this explains better why fp matters (google-translated from here):

There is another solution for decoding mp3 files that has been around for less than a year and whose technical specifications are quite impressive, and which should offer a quality that is significantly higher than that of MAD. This software is foobar2000, and uses the mpglib library as a decoder, largely modified by Peter Pawlowski. This decoding solution has the disadvantage of not supporting freeformat encodings, but as advantages, both theoretical and practical, we find the following characteristics:

  • decoding performed in 64 floating bits
  • integration of a disengageable dithering algorithm, with three noise shaping modes to limit its negative impact
  • full consideration of offsets, allowing gapless chaining of lame encodings (automatic) and any other mp3 encoder (manual)
  • full support for replaygain
  • management of several tag formats (ID3v1, ID3v2.x and APEv2)

All of these qualities are applicable to formats other than mp3, and - this is the most important aspect - are perfectly reproducible in the form of a test PCM file thanks to the complete diskwriter available with the software. Integrating foobar2000 should allow us to measure the contribution of the noise shaping technique, an essential complement to dithering that MAD strangely ignores. This technique should push back the nuisances linked to dithering (background noise) while preserving its contribution (higher resolution). From a theoretical point of view at least, foobar2000 should offer qualitatively superior decoding to that of MAD.

It's unfortunally no longer used/developed (a 2006-source code version shared from Peter here), but someone like @buhadram (that recently forked mpglibdll) can be interested in implement such optimizations.

I also suggest to check project like Audio Delossifier by @kroll-software that may help even more.

Last but not least, the approach you're appying seems very similar to the well-known Spectral band replication but without the encoded-datas for recontruction. This could be similar to what stereotool does in its Delossifier function, but you have to investigate better/deeper. In this field somone like @Abhipray - that some time ago developed an audio codec with SBR logic - may explain you better if it's feasible.

Anyway I believe that these types of alterations shouldn't be mandatory but, just like in stereotool, optionally activated by users.

Hope that helps.

bkraad47 commented 3 months ago

Hi the code was updated in version 1.1.0 to use fp 64 cheers

MarcoRavich commented 3 months ago

Hi the code was updated in version 1.1.0 to use fp 64 cheers

Nice.

Please join the 3ad I've started @ hydrogenaudio 'cause others are asking me things that I sincerely don't know (note that there are many audio and codecs experts there).

Thanks.

MarcoRavich commented 3 months ago

Here are some other interesting projects to collaborate with:

It would be also interesting to adopt better quality measurement method, such as the well-known Perceptual Evaluation of Audio Quality.

You can evaluate to integrate it in fatllama too: https://github.com/search?q=PEAQ+audio&type=repositories&s=updated

Last but not least, what about a "Jupyter Notebook" to let anyone test fatllama easily ? Check out Google Colab/HF Spaces/Paperspace/Kaggle/Jupiter/Deepnote/etc.

Hope that inspires !

MarcoRavich commented 3 months ago

Look and you shall find

Another interesting - even if speech-oriented - project with many cool resources (= papers) linked in: EnglishSpeechUpsampler by @jhetherly.

ivandustin commented 3 months ago

You can experiment with using transformer models to reconstruct frequencies for upscaling audio. For example, breathing sounds that are audible in CD quality but lost in MP3 could potentially be restored using a transformer model. While the model might introduce sounds that weren't present in the original recording, it can artificially enhance the details of the music. However, this approach requires significant effort and time to develop and fine-tune.

MarcoRavich commented 3 months ago

Thanks for your knowledge sharing @ivandustin !

Seems that well-known @IAHispano choosed @haoheliu's AudioSR for their Audio Upscaler:

At the moment the best approach seems to be the Audio Delossifier by @kroll-software one, which is trainable.

Anyway I would love to see a "scientific" (= spectral ?) quality comparison between all contenders...

...so I've decided to reorganize (and expand) the HyMPS' collection: AUDIO section \ AI-based category \ Enhancers page \ Upscalers

Enjoy !

MarcoRavich commented 3 months ago

Well the HA discussion is going on and - as always - something interesting comes out:

DeltaWave Audio Null Comparator: a free experimental software built to allow detailed comparison of two different captures of the same audio signal.

It performs a wide range of measurements and comparisons, which could be very useful in testing the effectiveness and accuracy of fat llama: please check it out.

EDIT Here's a DeltaWave report - with some graphs - on experiment files you shared:

DeltaWave v2.0.13, 2024-08-20T12:23:57.2062764+02:00 Reference: test_original_flac.flac[L] 7278467 samples 192000Hz 24bits, stereo, MD5=00 Comparison: test_converted_flac.flac[L] 8491546 samples 224000Hz 24bits, stereo, MD5=00 Settings: Gain:True, Remove DC:True Non-linear Gain EQ:False Non-linear Phase EQ: False EQ FFT Size:65536, EQ Frequency Cut: 0Hz - 0Hz, EQ Threshold: -500dB Correct Non-linearity: False Correct Drift:True, Precision:30, Subsample Align:True Non-Linear drift Correction:False Upsample:True, Window:Kaiser Spectrum Window:Kaiser, Spectrum Size:32768 Spectrogram Window:Hann, Spectrogram Size:4096, Spectrogram Steps:2048 Filter Type:FIR, window:Kaiser, taps:262144, minimum phase=False Dither:False bits=0 Trim Silence:True Enable Simple Waveform Measurement: False Resampled Reference to 224000Hz Discarding Reference: Start=0s, End=0s Discarding Comparison: Start=0s, End=0s Initial peak values Reference: -3,358dB Comparison: 0dB Initial RMS values Reference: -20,25dB Comparison: -16,675dB Null Depth=101,926dB Trimming 0 samples at start and 0 samples at the end that are below -90,31dB level X-Correlation offset: -3 samples Trimming 0 samples at start and 0 samples at the end that are below -90,31dB level Drift computation quality, #1: Excellent (0,3μs) Trimmed 171244 samples ( 764,482143ms) front, 224860 samples ( 1003,839286ms end) Final peak values Reference: -3,358dB Comparison: -3,801dB Final RMS values Reference: -20,505dB Comparison: -20,765dB Gain= 3,822dB (1,5528x) DC=0,00001 Phase offset=-0,013447ms (-3,012 samples) Difference (rms) = -32,77dB [-39,68dBA] Correlated Null Depth=42,37dB [41,89dBA] Clock drift: -0,01 ppm Files are NOT a bit-perfect match (match=0,13%) at 16 bits Files are NOT a bit-perfect match (match=0%) at 24 bits Files match @ 49,998% when reduced to 6,5 bits ---- Phase difference (full bandwidth): 143,874834932871° 0-10kHz: 23,92° 0-20kHz: 56,12° 0-24kHz: 66,07° Timing error (rms jitter): 514ns PK Metric (step=400ms, overlap=50%): RMS=-37,5dBFS Median=-46,7 Max=-29,4 99%: -30,89 75%: -34,7 50%: -46,7 25%: -50,74 1%: -55,33 gn=0,644017646514286, dc=1,40872942458951E-05, dr=-8,96573621136896E-09, of=-3,01221162486093 DONE! Signature: 85634bda22fde7ebef2c2e8a0c038e7a RMS of the difference of spectra: -78,1160326779584dB DF Metric (step=400ms, overlap=0%): Median=-18,9dB Max=-10,6dB Min=-30,8dB 1% > -30,78dB 10% > -27,18dB 25% > -24,67dB 50% > -18,88dB 75% > -13,32dB 90% > -12,28dB 99% > -2,19dB Linearity 2,1bits @ 0.5dB error

As already asked, a jupiter notebook (that let users run tests on Colab, for example) would be great for those - like me too - who don't own CUDA-enabled GPUs.

Last but not least, I've also found this interesting repo by @franciscorafart that may help too.

Hope that helps.