Implement ailia.audio to CLAP

AcculusSasao commented 9 months ago

issue #1318

コマンド引数 --ailia_audio を指定すると、メルスペクトログラムの計算に librosa ではなく ailia.audio を用いる。
amplitude_to_db (power_to_db) はnumpyで計算。
計算精度は、ailia.audio.mel_spectrogram() の修正待ち。

kyakuno commented 9 months ago

ailia.audio.mel_spectrogramにfminを設定した場合に計算精度の問題があるため、ailia SDK 1.2.17が必要。 ailia SDK 1.2.17のリリース後にマージ予定。

AcculusSasao commented 8 months ago

ailia_audio 1.3.0 にて、mel spectogram のfmin/fmax問題が修正され、誤差がほぼ無いことを確認。

サンプルの実行

オリジナル

$ python clap.py -e 0
 INFO arg_utils.py (13) : Start!
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0

 INFO arg_utils.py (163) : env_id: 0
 INFO arg_utils.py (166) : CPU
 INFO model_utils.py (86) : ONNX file and Prototxt file are prepared!
 INFO model_utils.py (86) : ONNX file and Prototxt file are prepared!
 INFO model_utils.py (86) : ONNX file and Prototxt file are prepared!
===== cosine similality between text and audio =====
audio: input.wav
cossim=0.1508, word=applause applaud clap
cossim=0.2941, word=The crowd is clapping.
cossim=0.0392, word=I love the contrastive learning
cossim=0.0754, word=bell
cossim=-0.0924, word=soccer
cossim=0.0310, word=open the door.
cossim=0.0850, word=applause
cossim=0.4185, word=dog
cossim=0.3813, word=dog barking
 INFO clap.py (179) : Script finished successfully.

修正前 ailia_audio

$ python clap.py -e 0 --ailia_audio
 INFO arg_utils.py (13) : Start!
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0

 INFO arg_utils.py (163) : env_id: 0
 INFO arg_utils.py (166) : CPU
 INFO model_utils.py (86) : ONNX file and Prototxt file are prepared!
 INFO model_utils.py (86) : ONNX file and Prototxt file are prepared!
 INFO model_utils.py (86) : ONNX file and Prototxt file are prepared!
===== cosine similality between text and audio =====
audio: input.wav
cossim=0.1517, word=applause applaud clap
cossim=0.3049, word=The crowd is clapping.
cossim=0.0366, word=I love the contrastive learning
cossim=0.0805, word=bell
cossim=-0.0872, word=soccer
cossim=0.0468, word=open the door.
cossim=0.0886, word=applause
cossim=0.4182, word=dog
cossim=0.3826, word=dog barking
 INFO clap.py (179) : Script finished successfully.

修正後 ailia_audio

$ python clap.py -e 0 --ailia_audio
 INFO arg_utils.py (13) : Start!
MESA-INTEL: warning: Performance support disabled, consider sysctl dev.i915.perf_stream_paranoid=0

 INFO arg_utils.py (163) : env_id: 0
 INFO arg_utils.py (166) : CPU
 INFO model_utils.py (86) : ONNX file and Prototxt file are prepared!
 INFO model_utils.py (86) : ONNX file and Prototxt file are prepared!
 INFO model_utils.py (86) : ONNX file and Prototxt file are prepared!
===== cosine similality between text and audio =====
audio: input.wav
cossim=0.1508, word=applause applaud clap
cossim=0.2941, word=The crowd is clapping.
cossim=0.0392, word=I love the contrastive learning
cossim=0.0754, word=bell
cossim=-0.0924, word=soccer
cossim=0.0310, word=open the door.
cossim=0.0850, word=applause
cossim=0.4185, word=dog
cossim=0.3813, word=dog barking
 INFO clap.py (179) : Script finished successfully.

axinc-ai / ailia-models

Implement ailia.audio to CLAP #1385