IPS-LMU / EMU-webApp

The EMU-webApp is an online and offline web application for labeling, visualizing and correcting speech and derived speech data.
http://ips-lmu.github.io/EMU-webApp/
MIT License
51 stars 14 forks source link

FLAC sound files should be supported #328

Open FredrikKarlssonSpeech opened 1 year ago

FredrikKarlssonSpeech commented 1 year ago

I am aware that you are out of funding, but allowing FLAC sound files would be a great addition. It seems that anything but wav files are stopped at the schema level at the moment, while the Emu config file has an explicit mediaExtension field that would allow FLAC files to be processed on equal terms as wav files.

FredrikKarlssonSpeech commented 1 year ago

It may be that you only need to allow it in the schema, as most browsers already support (playing) FLAC now.

https://www.lambdatest.com/web-technologies/flac

(except IE, but maybe one should not let that fact stand in the way)?

I realise that you need to be able to do more on the signal file, but maybe FLAC support in the media processing library you use is worth looking into? It would halve the transfer time of the sound content. I realize that it does not matter much when you are serving a database on / from your local machine, but when served to a remote user, this delay matters alot.

MJochim commented 1 year ago

I agree that FLAC support would be a nice addition.

I know FLAC achieves a compression rate of about 50% on audio (on music, anyway; not sure about speech, it may be even more there). I don't know what kind of compression gzip can achieve on audio/speech. I'm bringing it up because some web servers do or can be configured to compress payload data with gzip on the fly. Have you looked into that? What server software are you using?

MJochim commented 1 year ago

And I would definitely ignore IE. It's been so many years now that it was replaced with Edge.

FredrikKarlssonSpeech commented 1 year ago

We will look into what can already be done using server settings, but I had of course to do some testing using a sample recording from one of our ongoing projects.

Gzip time of compression (with increasing compression settings):

1:        0,45 real         0,44 user         0,00 sys
2:        0,46 real         0,46 user         0,00 sys
3:        0,55 real         0,53 user         0,00 sys
4:        0,62 real         0,61 user         0,00 sys
5:        0,85 real         0,84 user         0,00 sys
6:        1,07 real         1,05 user         0,00 sys
7:        1,09 real         1,08 user         0,00 sys
8:        1,10 real         1,09 user         0,01 sys
9:        1,11 real         1,09 user         0,01 sys
52760 test.wav
40448 9_test.gzip
39808 8_test.gzip
39680 7_test.gzip
39808 6_test.gzip
39552 5_test.gzip
39480 4_test.gzip
39728 3_test.gzip
39816 2_test.gzip
39864 1_test.gzip

So the file sizes are about 74-75% after compression. (And compression level does not matter so much, but add considerable processing time).

FLAC:

Processing time

0:        0,23 real         0,17 user         0,03 sys
1:        0,21 real         0,18 user         0,02 sys
2:        0,24 real         0,21 user         0,02 sys
3:        0,22 real         0,19 user         0,02 sys
4:        0,24 real         0,21 user         0,02 sys
5:        0,29 real         0,27 user         0,02 sys
6:        0,38 real         0,35 user         0,02 sys
7:        0,39 real         0,36 user         0,02 sys
8:        0,59 real         0,56 user         0,02 sys

and then you get these file sizes (depending on compression level):

52760 test.wav
21608 8_test.flac
21640 7_test.flac
21840 6_test.flac
21888 5_test.flac
21896 4_test.flac
22080 3_test.flac
25544 2_test.flac
25624 1_test.flac
25616 0_test.flac

about 41-49% of the original file size.

So FLAC is of course quite a bit faster and more efficient in encoding speech files, but that is maybe not so surprising.

FredrikKarlssonSpeech commented 1 year ago

As a point of reference I of course had to have a look at bzip2 and

1:        2,05 real         2,00 user         0,03 sys
2:        1,93 real         1,89 user         0,02 sys
3:        1,93 real         1,89 user         0,02 sys
4:        1,93 real         1,90 user         0,02 sys
5:        1,95 real         1,92 user         0,02 sys
6:        1,96 real         1,92 user         0,02 sys
7:       1,99 real         1,95 user         0,03 sys
8:        1,98 real         1,95 user         0,02 sys
9:        2,00 real         1,97 user         0,02 sys

and file sizes

52760 test.wav
36504 9_test.bzip
35832 8_test.bzip
37504 7_test.bzip
36208 5_test.bzip
36032 6_test.bzip
35824 4_test.bzip
36000 3_test.bzip
36232 2_test.bzip
37160 1_test.bzip

I also tested LZMA2 (which it seems is discussed re browser support too)

1:        3,07 real         2,72 user         0,03 sys
2:        3,90 real         3,85 user         0,02 sys
3:        7,16 real         7,11 user         0,05 sys
4:        9,03 real         8,97 user         0,05 sys
5:       11,34 real        11,25 user         0,07 sys
6:       11,89 real        11,73 user         0,12 sys
7:       13,75 real        13,49 user         0,18 sys
8:       13,43 real        13,19 user         0,21 sys
9:       13,16 real        12,97 user         0,15 sys
52760 test.wav
32896 9_test.lz
32896 8_test.lz
32896 7_test.lz
32896 6_test.lz
32896 5_test.lz
32896 4_test.lz
36992 3_test.lz
36992 2_test.lz
36992 1_test.lz

Obviously not a strong improvement over gzip for speech recordings.

klausj commented 1 year ago

Another good lossless audio compression is Wavpack. However, I could not find any information about Wavpack support in browsers. According to Caniuse, FLAC is supported by all modern browsers.

MJochim commented 1 year ago

Thank you for the benchmark. What is the unit for size here? Byte? Kilobyte? Block?

I am thinking, maybe turning off on-the-fly gzip compression might even be faster in the end. With the server’s on-the-fly encoding, we’re looking at a trade-off in computation time vs. data transfer time. With minuscule compression and fast-ish data transfer, the computation time might not be worth it (depending of course on a lot of factors).

The same may hold true, however, for Flac: (especially) if the client has a weak CPU, decompressing the Flac might kill the gain in transfer time. Hence why I ask if we are looking at ~50 KB (~one word or utterance per recording) or ~50 MB (long recording) in your benchmark. And what kind of delay times are you talking about with your remote users? The half seconds or couple of seconds that make a tool seem less responsive? Or dozens of seconds, or minutes?

I can give this some more thought tomorrow.

FredrikKarlssonSpeech commented 1 year ago

Hence why I ask if we are looking at ~50 KB (~one word or utterance per recording) or ~50 MB (long recording) in your benchmark. And what kind of delay times are you talking about with your remote users? The half seconds or couple of seconds that make a tool seem less responsive? Or dozens of seconds, or minutes?

The file is a 27MByte stereo recording of about 3 minutes in length. Not sure how that translates to the figures above, but the most important thing is that that the files are the same.

The time is parts of a second.

Of course, I had to look at decoding time too

GZIP

1:        0,13 real         0,12 user         0,00 sys
2:        0,11 real         0,10 user         0,00 sys
3:        0,11 real         0,10 user         0,00 sys
4:        0,11 real         0,10 user         0,00 sys
5:        0,11 real         0,10 user         0,00 sys
6:        0,11 real         0,10 user         0,00 sys
7:        0,11 real         0,10 user         0,00 sys
8:        0,11 real         0,10 user         0,00 sys
9:        0,11 real         0,10 user         0,00 sys

FLAC

       0,16 real         0,13 user         0,02 sys        
        0,14 real         0,12 user         0,01 sys        
        0,14 real         0,12 user         0,01 sys
        0,18 real         0,17 user         0,01 sys
        0,18 real         0,17 user         0,01 sys
        0,18 real         0,17 user         0,01 sys
        0,16 real         0,14 user         0,01 sys
        0,15 real         0,14 user         0,01 sys

So, there is a 24-63% processing time increase when processing FLAC compared to gzip. Not sure that they are mutually exclusive, but with gzip compression in the browser, of course, you also might get faster transfer of SSFF signal files too. (But not much of a possible speedup when they are : the file sizes below are the default and extreme gzip compression levels applied to the most compressed FLAC file)

21584 5_8_test.gz
21584 9_8_test.gz

Of course, if one could have a 3GB speech database stored on disc rather than a 6GB one, and seamlessly work with it still in the web app, then that would be a good thing.