On Bitrate in paper Is synthetic voice detection research going into the right direction?

Dear authors,

I quickly read the paper and found it very interesting. I do have many questions to ask, but the first thing I would like to discuss with you is the bitrate.

Based on what I know about the data artifacts in the database, I feel not surprised to see the different distributions of bitrate. In fact, I think it is due to the silence issue [1]. Bitrate is just another way to show the artifact of silence. Of course, another way is to plot the distribution of silence (see [1] figure 1).

Here is my arguments:

Bit rate for PCM WAV is constant, it is equal to sampling rate * bit per sampling point. If we convert FLAC to WAV, we will see 256kbps for all the data in the database;
"Bit rate" for FLAC is simply computed as file-size (Bytes) * 8 bit per byte / duration (s). Because FLAC compresses the audio (losslessly), the file size will be different, and "bit rate" will be different. Given the same duration, the more we compress an audio file, the smaller its size is, and the smaller its bit rate is;
Thus, the difference of "bit rate" reflects how hard the audio can be compressed (lossless) by FLAC. I am not familiar with the detailed algorithm, but I think FLAC may use linear prediction https://xiph.org/flac/format.html. A general idea is that a random signal will need more bits; in contrast, a signal needs fewer bits if it can be better fitted using a linear prediction model;
Bonafide uttterances in ASVspoof2019 has long leading / trailing silence segments, this is where we need fewer bits per second for FLAC. Thus, they are smaller in file size and bit rate. Many spoofed trials do not have long silence, and they need more bits per second, larger file size, higher bit rate.
A17, A18, and A19 are voice conversion systems. They are sourced from bona fide speech, and they keep the leading / trailing silence. Therefore, their bit rates are not affected by the silence issue.

In all, I think the difference on bit rates correlates with the issue of silence.

My humble feeling is that the bitrate can be better described (what it is & what it means), and the message can be better delivered to the community. Simply saying bit rate is discriminative adds another piece of fact but leaves the reason obscure.

Finally, I am one of the authors for that database. My comments may be biased.

[1] N. Müller, F. Dieckmann, P. Czempin, R. Canals, K. Böttinger, and J. Williams, “Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn?,” in Proc. ASVspoof Challenge workshop, 2021, pp. 55–60. doi: 10.21437/ASVSPOOF.2021-9.

By the way, assuming the same training and development set above, to address the data artifact and test the generalizability of a detector, the community has many directions now. For example:

ASVspoof2021 has hidden tracks in the test set that exclude the leading / trailing silence. Testing against such a test set can reflect the robustness of the detector. (see resources here https://www.asvspoof.org/index2021.html)
Many papers jump out of the standard database protocol and test the trained model using databases from other sources. See https://arxiv.org/pdf/2203.16263.pdf. There are other databases that can be used for testing too, for example, WaveFake databases: https://openreview.net/forum?id=74TZg9gsO8W

To expand the space of training data, self-supervised speech model and data augmentation are also helpful to improve the detector's generalizability: [1] J. M. Mart\’\in-Doñas and A. Álvarez, “The Vicomtech Audio Deepfake Detection System Based on Wav2vec2 for the 2022 ADD Challenge,” in Proc. ICASSP, 2022, pp. 9241–9245. [2] https://www.isca-speech.org/archive/odyssey_2022/wang22_odyssey.html [3] https://www.isca-speech.org/archive/odyssey_2022/tak22_odyssey.html

There are more that I cannot list here. I may be too optimistic, but I think the failure of the database does not mean all directions in this research field are doomed : )

First of all, thanks for creating this issue, you are clarifying several things that we are trying to study and understand.

Bit rate for WAV is constant, it is equal to sampling rate * bit per sampling point. If we convert FLAC to WAV, we will see 256kbps for all the data in the database;

I am not sure about FLAC/WAV descriptions, sometimes to analyze any audio file, for practical reasons, I convert them into WAV, anyway, I am not really familiar too with the related algorithm behind the audio formats.

Simply saying bit rate is discriminative adds another piece of fact but leaves the reason obscure.

Totally agree, indeed I am investigating the reasons behind it, to give a better explanation, we said it more for a "warning".

A17, A18, and A19 are voice conversion systems. They are sourced from bona fide speech, and they keep the leading / trailing silence. Therefore, their bit rates are not affected by the silence issue.

This explains the results of our experiments.

In all, I think the difference on bit rates correlates with the issue of silence.

To be honest, I was already analyzing the silence of the audio in the Fake-or-Real dataset and on ASVspoof2019 LA these days.

For Fake-or-Real I've found similar features like bitrate that are really significant, moreover, I already extracted the silence and in Fake-or-Real the silence makes easier the detection through the features that we exposed in the paper, so everything you are writing here makes really more clear to me the results of my experiments.

I still need to analyze ASVspoof2021 and the silence in ASVspoof2019, but if what I've found in ASVspoof2019 LA is present also in other datasets like Fake-or-Real means that the work on ASVspoof has more chances of generalizability.

I may be too optimistic, but I think the failure of the database does not mean all directions in this research field are doomed
: )

Of course, we only would like to underline that there are some points that can be improved or that at least are really significant for fake/real detection.

I am not sure about FLAC/WAV descriptions, sometimes to analyze any audio file, for practical reasons, I convert them into WAV, anyway, I am not really familiar too with the related algorithm behind the audio formats.

Sorry, I misused the terminology, I mean the WAV in PCM coding scheme https://en.wikipedia.org/wiki/WAV#Comparison_of_coding_schemes. I am wondering what kind of command you used to convert FLAC to WAV.
If you use other coding schemes for WAV, the bitrate varies with the coding scheme. But all files have the same bitrate.

In my case, using sox NN.flac NN.wav will do the conversion to 16bit PCM WAV http://sox.sourceforge.net/sox.html. The above command above convert the FLAC to WAV with PCM coding scheme, keeping the 16 bit width. By definition, the bit rate is and will be 256kbps = 16 kHz * 16 bit.

Then, I can use soxi to show the bitrate of the file soxi NN.wav. Here is what I get:

$: sox ./LA_T_1029929.flac ./LA_T_1029929.wav
$: soxi LA_T_1029929.wav 

Input File     : 'LA_T_1029929.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:02.85 = 45620 samples ~ 213.844 CDDA sectors
File Size      : 91.3k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

For FLAC,

$: soxi LA_T_1029929.flac 

Input File     : 'LA_T_1029929.flac'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:02.85 = 45620 samples ~ 213.844 CDDA sectors
File Size      : 48.0k
Bit Rate       : 135k
Sample Encoding: 16-bit FLAC

This bit rate is computed (roughly) by 48k bytes * 8 bit / byte / (45620 / 16000) = 134677 bps = 135kbps.

More accurate computation needs to remove the size of the file head, but it is a small difference. Note that, pydub calls ffprobe -show_format -show_streams FILENAME to get the bitrate https://ffmpeg.org/ffprobe.html. The value returned by pydub is affected by the file head. You can also use ffprobe directly to read the information

ffprobe -hide_banner -show_format -show_streams LA_T_1029929.wav
Input #0, wav, from 'LA_T_1029929.wav':
  Duration: 00:00:02.85, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s
[STREAM]
index=0
codec_name=pcm_s16le
codec_long_name=PCM signed 16-bit little-endian
profile=unknown
codec_type=audio
codec_tag_string=[1][0][0][0]
codec_tag=0x0001
sample_fmt=s16
sample_rate=16000
channels=1
bits_per_sample=16
...
duration_ts=45620
duration=2.851250
bit_rate=256000
...

Anyway, I think above all is all about the definition of bitrate. The original 16bit PCM WAV audio files should have the same bitrate. The difference in terms of "bitrate" in FLAC is caused by compression, and the degree of compression is further decided by the data itself.

Here is a trial analysis on one file LA_T_1029929.flac.

I take its first 1s and saved it as LA_T_1029929_s1.wav and LA_T_1029929_s1.flac. This file is mainly silence region at the beginning.

Then I take the second 2s segment and saved it as LA_T_1029929_s2.wav and LA_T_1029929_s2.flac. This file is mainly speech sounds.

Here are the bitrate from soxi

$ soxi LA_T_1029929_s1.flac 
...
Bit Rate       : 78.7k
...

$ soxi LA_T_1029929_s2.flac 
...
Bit Rate       : 177k
...

$: soxi LA_T_1029929_s1.wav
...
Bit Rate       : 256k
...

$ soxi LA_T_1029929_s2.wav
...
Bit Rate       : 256k
...

See how the s1.flac has a lower bitrate. LA_T_1029929_files.zip

For your analysis, you can trim the silence from all the audio files and plot the distribution of "bit rate" of FLAC audios. I guess the distribution should more or less overlap.

However, if there is difference, I am ready to give a complementary explanation. I will be very honored to hear what you will get in the near future!

Thanks a lot! I think that I will publish all my analysis on this webpage https://unict-fake-audio.github.io/ASVspoof2019-feature-webview/dataset-webview/

For convert FLAC to WAV I use pydub.AudioSegment and the related functions from_file() and export().

UNICT-Fake-Audio / fake-audio-detector

On Bitrate in paper Is synthetic voice detection research going into the right direction? #2