Crash trying to encode large buffer

emoon commented 11 months ago

Hi,

First of all thanks for a great crate.

I'm running into a crash when I'm trying to encode a "large" buffer. Here is a small repro case for it.

use vorbis_rs::VorbisEncoderBuilder;
use std::fs::File;

fn main() {
    let mut file = File::create("test.ogg").unwrap();

    let mut encoder = VorbisEncoderBuilder::new(
        core::num::NonZeroU32::new(48000).unwrap(),
        core::num::NonZeroU8::new(1).unwrap(),
        &mut file,
    ).unwrap().build().unwrap();

    let buffer = vec![0f32; 48000 * 100];
    let t = [&buffer];

    encoder.encode_audio_block(&t).unwrap();
    encoder.finish().unwrap();
}

Callstack

* thread #1, name = 'vorbis_test', stop reason = signal SIGSEGV: invalid address (fault address: 0x7ffffedac3a0)
  * frame #0: 0x00005555555832ad vorbis_test`_preextrapolate_helper(v=0x00005555556eaa40) at block.c:450:16
    frame #1: 0x0000555555583778 vorbis_test`vorbis_analysis_wrote(v=0x00005555556eaa40, vals=4800000) at block.c:549:7
    frame #2: 0x0000555555575af6 vorbis_test`vorbis_rs::encoder::encoder_impl::VorbisEncoder$LT$W$GT$::encode_audio_block::h152463b898dee8a7(self=0x00007fffffffd080, audio_block=0x00007fffffffd498) at encoder_impl.rs:369:4
    frame #3: 0x0000555555572cc8 vorbis_test`vorbis_test::main::h08dbc0d2c10d4962 at main.rs:16:5

I can workaround this issue by manually sending smaller chunks of the data to encode_audio_block but I don't see any limitation in the API/docs that would indicate that I have to do that.

AlexTMjugador commented 11 months ago

Hi, thanks for the great bug report! I could easily see what went wrong and reproduce the issue.

The problem was a bit tricky to investigate and fix, as it boiled down to libvorbis using the dangerous alloca function to allocate a buffer at least sample_count * 4 bytes big on the stack. In your example code, sample_count is 48000 * 100, which translates to libvorbis allocating a whopping 18.31 MiB on the stack! For context, the default stack size Rust sets for threads it spawns is usually 2 MiB, and stacks larger than 8 MiB are rarely seen, so this allocation fails and the stack overflows, eventually triggering a SIGSEGV.

The possible solutions are to modify libvorbis to call malloc/realloc instead, or to make sure to allocate a big enough stack on the heap before calling the pesky alloca (c.f. the stacker crate). For reasons explained in the description of the linked commit and the related libvorbis submodule commit, I chose the first approach.

Please let me know if the fix works for you or if you have anything else to add :smile:

Edit: minor wording tweaks according to a (now deleted) comment by @emoon, thanks!

AlexTMjugador commented 11 months ago

Related upstream PR: https://github.com/xiph/vorbis/pull/104

emoon commented 11 months ago

If the issue isn't fixed in vorbis it's also possible inside the Rust code to do a similar workaround to what I did in my own code (i.e you send smaller chunks of the buffer over to the C code instead)

AlexTMjugador commented 11 months ago

Yes, that's also a valid option, although the exact maximum buffer size is platform-dependent and thus difficult to calculate correctly. Given that vorbis_rs uses a custom libvorbis fork anyway, it's not a maintenance burden to patch it for the time being.

emoon commented 11 months ago

Wanted to make an update about this here. So I have tested with the latest code and it does indeed fix the crash, but there is another problem with it. When I use my old code where I send chunks of samples (in my case 48k for each channel) to audio_encode_block encoding the song takes about 3 sec (which includes some other stuff also, but that isn't important here)

When I use the latest code and send the whole buffer it takes about 150 sec (!) to do the encoding now. So almost 50x longer. Now I'm sure it's not the fault of the Rust crate here, but this is something users may run into if they also send in a large buffer and have no idea what to expect of the encoding time.

AlexTMjugador commented 11 months ago

Great catch! I attached a profiler to two runs of the encoder over a 5 minutes, mono, 44.1 kHz song, the only difference being that in one run I chunked the input in blocks of 65536 samples, while in the other I handed off the entire song to the encoder in a single block. The results were as follows:

Non-chunked run: ~26 s

![Non-chunked run profile](https://github.com/ComunidadAylas/vorbis-rs/assets/7822554/83107275-efaa-4bf5-849a-c891f6eb6355)

Chunked run: ~1.3 s

![Chunked run profile](https://github.com/ComunidadAylas/vorbis-rs/assets/7822554/2cb659b1-e3ec-48c5-8fe5-132e1b9b9bc8)

As it stands out from the results of the profiler, the slowdown in the non-chunked case is entirely due to a very slow implementation of vorbis_analysis_blockout, which spends a non-negligible amount of time copying memory buffers around. The encoding itself, done by the vorbis_analysis function, is responsible for only about 2.7% of the runtime.

In stark contrast, when chunking the input into smaller blocks, the encoding is responsible for most CPU cycles, even though vorbis_analysis_blockout takes a non-negligible 14% of the total execution time.

In any case, there is not anything this Rust crate can do about the slowdown, other than trying to come up with some libvorbis patch to improve the performance of vorbis_analysis_blockout.

For now, I've added the following note to the encode_audio_block method documentation to make users aware of the tradeoffs between different block sizes, and hopefully guide them to choose the block size that works best for their application (I'm a bit reluctant to do any block size manipulation in this crate, as the crate doesn't have information about the entire use case to choose the optimal block size, and I don't want to introduce "hidden" chunking or buffering costs):

Although the Vorbis encoder is meant to support blocks of any size, there are some performance considerations that users should bear in mind when choosing an optimal block size for their application:

Blocks that are too small (as a rough guideline, between one and 16 samples in size) result in more CPU overhead, which translates into slower execution times due to more repetitions of the per-block encoding setup logic. However, Vorbis is quite well optimized to handle small block sizes: in practice, no slowdowns greater than 2x have been observed even when using a single sample per channel and block, but your mileage may vary.

Too large blocks (e.g., from 2¹⁸ = 262144 samples) have apparently not been tested much, and lead to a sharp degradation of the encoding runtime and much higher maximum memory usage as the block size increases. Slowdowns of several orders of magnitude have been observed when encoding minutes of audio as a single block.

As a rule of thumb, a pretty good block size is at most a few seconds of audio, or no orders of magnitude larger than the maximum Vorbis encoding window size, 2¹³ = 8192, and does not require your application to do any chunking (i.e., splitting of larger blocks into smaller ones) or buffering (i.e., combining smaller blocks into larger ones). When in doubt, use smaller block sizes. The libvorbis documentation states that "1024 is a reasonable choice" for a block size.

ComunidadAylas / vorbis-rs

Crash trying to encode large buffer #17