Is there a native compile for the df binary?

StuartIanNaylor commented 1 year ago

Is there any info on compiling the binary with -march=native

Just started looking so apols

Rikorose commented 1 year ago

I tested this on a raspberry pi 4 and it did not make any difference. Mostly because all processing intense parts (FFT, neural network via tract) enable SIMD optimization at runtime after CPU detection.

StuartIanNaylor commented 1 year ago

Yeah I have a RK3588 which is a A76/A55 ARMv8.2-A and just wondered about trying. I have been playing with https://github.com/ggerganov/whisper.cpp and did the same got me 30% improvement so was curious.

Rikorose commented 1 year ago

Try it out and report back :rocket:

StuartIanNaylor commented 1 year ago

Hendrik it was why I was asking as know zero about rust apart from a Google

export RUSTFLAGS="-C target-cpu=native"
cargo build --release

or

rustc -C target-cpu=native test.rs

But I am such a noob with Rust I can not even figure where the binary cargo build is :) I will as getting good results Enhanced audio file test2.wav in 20.67 (RTF: 0.25521383) but was wondering if it might and maybe if it could be threaded?

Its prob very easy but from the root cargo build --release seems to build the LADSPA lib but prob due to it being so easy I am just not seeing how to build the binary?

rock@rock-5b:~/nvme/DeepFilterNet-0.3.0/target/release$ ls
build  examples     libdeep_filter_ladspa.d   libdf.d     liblibdf.d   liblibdfdata.d
deps   incremental  libdeep_filter_ladspa.so  libdf.rlib  liblibdf.so  liblibdfdata.so

I give in as go into the deepfilter folder and run from there but the version I have seems to be a debug version as it is different I get this


[2022-11-24T00:30:34Z INFO  df::tract] Processed frame in 2.63ms (analysis: 0.02ms, encoder: 1.10ms, erb_decoder: 0.79ms, df_decoder: 0.70ms, synthesis: 0.02ms)
[2022-11-24T00:30:34Z INFO  df::tract] Processed frame in 2.65ms (analysis: 0.02ms, encoder: 1.10ms, erb_decoder: 0.78ms, df_decoder: 0.73ms, synthesis: 0.02ms)
[2022-11-24T00:30:34Z INFO  df::tract] Processed frame in 2.63ms (analysis: 0.02ms, encoder: 1.11ms, erb_decoder: 0.77ms, df_decoder: 0.71ms, synthesis: 0.02ms)
[2022-11-24T00:30:34Z INFO  df::tract] Processed frame in 2.64ms (analysis: 0.03ms, encoder: 1.11ms, erb_decoder: 0.77ms, df_decoder: 0.71ms, synthesis: 0.02ms)
[2022-11-24T00:30:34Z INFO  df::tract] Processed frame in 2.64ms (analysis: 0.02ms, encoder: 1.12ms, erb_decoder: 0.77ms, df_decoder: 0.70ms, synthesis: 0.02ms)
Enhanced audio file test2.wav in 21.51 (RTF: 0.2655859)

whilst the ready binary gives

[2022-11-24T00:31:49Z WARN  df::tract] Possible clipping detected (1.000).
[2022-11-24T00:31:49Z WARN  df::tract] Possible clipping detected (1.000).
[2022-11-24T00:31:49Z WARN  df::tract] Possible clipping detected (1.000).
[2022-11-24T00:31:49Z WARN  df::tract] Possible clipping detected (0.997).
[2022-11-24T00:31:49Z WARN  df::tract] Possible clipping detected (1.000).
Enhanced audio file test2.wav in 20.99 (RTF: 0.25912935)

Rikorose commented 1 year ago

Threading is not supported by the DNN inference framework.

If you want to build the binary use:

cargo build -p deep_filter --release --bin deep-filter --features "bin,tract,wav-utils,transforms"

StuartIanNaylor commented 1 year ago

build Enhanced audio file test2.wav in 20.58 (RTF: 0.25401807)

orig Enhanced audio file test2.wav in 20.66 (RTF: 0.25508824)

So very little, but looking and liking Rust it seems to trade some of the granularity for simplicity and interoperability with things like https://doc.rust-lang.org/core/arch/aarch64/index.html# So pointless in terms of results but very interesting and informative for a Rust noob. Guess if your not such a Rust noob then https://rustc-dev-guide.rust-lang.org/mir/optimizations.html

Shame about the threading as looking at the model it doesn't splt very easily?

Rikorose commented 1 year ago

Well you could run the ERB decoder and DF decoder in parallel I guess.

StuartIanNaylor commented 1 year ago

Yeah I have been just sat here feeding https://github.com/ggerganov/whisper.cpp via DeepFilterNet where it doesn't matter so much as I am getting x4 realtime but likely even on a small core it would be faster than realtime. I was thinking about a Pi0W but likely a Pi2 (which it is really) would still run in a single thread? Haven't tested so hence the question mark, will do though as have one and they may eventually become for sale again.

Running the ERB decoder and DF decoder in parallel is probably superior as last time I ran with TFlite with threading the scaling was an improvement but was no where near 2x, above 2 thread seemed to have little or no effect than x2 threads. So presume nunning the ERB decoder and DF decoder in parallel is going to be much nearer x2 than a threading framework anyway!

I think DeepFilterNet would be brilliant for a KWS where basically its far simpler image matching as with Whisper unless you drop ATTEN_LIM_DB to about 10dB (still playing) it actually makes whisper worse as the bigger models still work with large additional noise. Still though if you so drop ATTEN_LIM_DB to about 10dB and get less DeepFilterNet artefacts as 100bB is a very high target with all Whisper Model sizes you get an improvement. I am not exactly sure about all the settings in what to make optimum, yet with Whisper as its pretty awesome but only really starts getting to awesome level with the medium model or above.

Likely you could get a much lighter ASR that you can train and preprocess much of the dataset (all) with various levels of noise passed through DeepFilterNet as that is what I am thinking of doing with KWS as expecting the results to be pretty awesome. I have used DTLN before but DeepFilterNet far surpasses it. OpenAI released the model but kept the training details to themselves.

Which got me thinking as Briezhn with some small modifications and training used it for AEC to cancel a ref signal, which wondering is the same possible as with the attenuation levels you get that would be an awesome non linear AEC able to run on fairly modest non DSP hardware.

PS Hendrik have you ever tried applying a model to source seperation where its extract voice than cancel noise? Seriously impressed and if you ever have an interest in the last 2 please do so.

Thnx

StuartIanNaylor commented 1 year ago

@Rikorose PS this might be of passing interest as with C++ I was just lucky as https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1324415070 the dev is specially optimising for Macs which are also ArmV8.2 so just in this case march=native gave me 30% as just in this case the code is specific. Likely it also causes penalties for non ArmV8.2 so its all swings and roundabouts but thought I would post for interest only and also just why I was asking out of curiosity.

I am really liking what I see with Rust as its sort of C++ with a Python like package manager and its only the C++ purists optimising for a specific instruction set that have any argument its faster, which for someone with my knowledge level has made things much clearer and yeah I am thinking about taking the dive with Rust, so thanks.

Rikorose commented 1 year ago

Running the ERB decoder and DF decoder in parallel is probably superior as last time I ran with TFlite with threading the scaling was an improvement but was no where near 2x, above 2 thread seemed to have little or no effect than x2 threads. So presume nunning the ERB decoder and DF decoder in parallel is going to be much nearer x2 than a threading framework anyway!

You still have some sequential parts, like STFT, and the encoder DNN

I think DeepFilterNet would be brilliant for a KWS where basically its far simpler image matching as with Whisper unless you drop ATTEN_LIM_DB to about 10dB (still playing) it actually makes whisper worse as the bigger models still work with large additional noise.

Yes, I guess without attenuatuation limit, the ASR performance gets worse due to the remaining artifacts. What attenuation limit did you settle with for best ASR performance?

Which got me thinking as Briezhn with some small modifications and training used it for AEC to cancel a ref signal, which wondering is the same possible as with the attenuation levels you get that would be an awesome non linear AEC able to run on fairly modest non DSP hardware.

I did not experiment with AEC yet

@Rikorose PS this might be of passing interest as with C++ I was just lucky as https://github.com/ggerganov/whisper.cpp/issues/89#issuecomment-1324415070 the dev is specially optimising for Macs which are also ArmV8.2 so just in this case march=native gave me 30% as just in this case the code is specific.

The DNN inference framework (tract) has optimized assembly kernels for several arm architectures. Thus, march=native wont make a huge difference for DeepFilterNet.

StuartIanNaylor commented 1 year ago

Yes, I guess without attenuatuation limit, the ASR performance gets worse due to the remaining artifacts. What attenuation limit did you settle with for best ASR performance?

Its was actually really low and only on tiny and base models 10dB as whisper seems super sensitive but somewhere in the model there is a voicefilter or the model acts like a natural voice filter and likely there are frequencies and lower magnitude elements that are being filtered out that Whisper relies on. PS it just occured to me to try a double DeepFilterNet to see if similar happens feeding DeepFilterNet audio into DeepFilterNet purely for curiosity to results.

If we had access to OpenAi's training routines then likely creating a dataset with varying noise & clean processed by DeepFilterNet this would not be an issue, but likely its a matter of finding alternatives to Whisper with training.

I did not experiment with AEC yet

Non linear AEC of DeepFilterNet standard would be amazing as again it can go central and satelites can record a ref channel with capture.

I don't think it really matters so much as C++ only gives optimisation advantages for very specific instruction sets whilst a distributed binary is never going to be so anyway.

StuartIanNaylor commented 1 year ago

@Rikorose I am having a go at creating a KWS model where the dataset has been created with noise vary 30-0dB with deepfilter running at 28db to start with. Will let you know how things go as dataset takes some time to create also with different levels to test vs standard training with & without noise.

StuartIanNaylor commented 1 year ago

@Rikorose Hi Hendrik I just got round to testing on a Pi3 and the Ladspa plugin unfortunately is too much load. If you do ever have the time or inclination as prob too much for me but a version running the ERB decoder and DF decoder in parallel likely would be perfect and would instantly solve the 'barge in' problem of low end Pi KWS.

StuartIanNaylor commented 1 year ago

@Rikorose Hendrik I created a KWS and it works great with DeepFilterNet and just out of interest you should try OPenAi's whisper as DeepFilterNet kills recognition and trying to figure out why?!

The Medium model is the one to go for and maybe it has some sort of filtering in already as compared to any ASR I know its ability with noise is amazing. Its a curious one as straight MFC KWS works great and Whisper not so and it might be interesting for you to try as I wonder what information, frequency, timing... Dunno but interesting. With the KWS I thought I may have to add noise to the dataset run DeepFilterNet to create the training dataset to get a finger print of DeepfilterNet but doesn't seem to need it. Only mentioning as its quite curious to the level effect it has and surprising...

Rikorose / DeepFilterNet

Is there a native compile for the df binary? #190