OSU-Bee-Lab / buzzdetect

a machine learning tool to detect and classify bee buzzes in audio
GNU General Public License v3.0
4 stars 1 forks source link

Popen not appropriately multithreading ffmepg processes #8

Closed LukeHearon closed 11 months ago

LukeHearon commented 11 months ago

While running analyze_multithread, the chunking steps appear to be completely sequential. Each thread is at full load until its respective file is finished and all empty chunk files are created simultaneously, but they only build one at a time. Is this an issue with Popen? With multiple ffmpeg processes reading from the same file?

LukeHearon commented 11 months ago

Hmmm... actually, running make_chunk_from_control directly*, it seems like there might just be some delay between when each file starts building. E.g.: image

This image makes it look like some threads aren't processing, but:

image

They are actually all building at the same time. Maybe my test files were just too small and the delay between start times made it look like a serial operation?

Shoot, nope. Running from analyze_multithread is truly a serial operation (final wav size is 187.5, each file doesn't start building until the last is done): image

*Edit: I think I wasn't running make_chunk_from_control in its entirety; when I run the code using the function (e.g. make_chunk_from_control(control), I get the same serial operation. Maybe the problem is in the p.wait() line? I thought this was intended to make Python wait until the processes were done before moving to its next line, not to make each process wait on the one prior.

LukeHearon commented 11 months ago

Nah, leaving out p.wait() doesn't help.

LukeHearon commented 11 months ago

Aha. I think the issue is that ffmpeg won't read multiple parts of the same file at the same time. So each thread is waiting for the file read to reach its start point. That delay in previous testing was because each chunk started a few seconds after the one prior. Here's the test: if I run the same chunking operation 8 times, will they all process at the same time?

LukeHearon commented 11 months ago

if I run the same chunking operation 8 times, will they all process at the same time?

Yep! They do. Crap. Best solution: find a way to make ffmpeg read the same file at multiple places. Failing that: have a different source on each thread.

LukeHearon commented 11 months ago

This brings up another question; am I CPU limited or I/O limited? If I'm I/O limited, reading from n files will slow down the ffmpeg process to at least (1/n) the original speed. I should test this first, and if I find that working one file at a time is faster, then I'll just have a single-threaded worker chunking files and all other threads analyzing.

LukeHearon commented 11 months ago

Made a directory with 9 audio files, each a unique, full-length field recording. There's 1 "single.mp3" and 8 "multi_#.mp3" where # runs 1–8. The single-file process takes 8 sequential chunks from the "single" audio file, run on 8 threads; the multi-file process takes 8 starting chunks of the same length from each of the "multi" audio files, each on its own thread. Using mcapply in R showed the same serial behavior as Popen in Python, so this does appear to be an ffmpeg thing (or an inherent file reading thing?) and not something to do with my Python script. microprocess() was run with times=5 for each approach. Results are as follows.

30 second chunks

image

Multi-file is 2.5x the speed

360 second chunks

image

Multi-file is 3.8x the speed

3600 second (1 hour) chunks

image

Multi-file is 3.8x the speed

Solution?

Assign one thread per file. Unfortunately, if # threads > # files, some threads will just have to be idle. There's no performance to be gained by splitting one file into multiple chunks for multiple threads. On the other hand, this makes chunking very easy! And if we launch CPU-based analysis once chunking has begun, those threads won't be idle anyways.

I'll mark solved once I integrate thread-per-file chunking in the chunklist.

LukeHearon commented 11 months ago

and it only took an hour ;)