Audio Refactoring Discussion

OpenShot / libopenshot

OpenShot Video Library (libopenshot) is a free, open-source project dedicated to delivering high quality video editing, animation, and playback solutions to the world. API currently supports C++, Python, and Ruby.

http://www.openshot.org

GNU Lesser General Public License v3.0

1.23k stars 267 forks source link

Audio Refactoring Discussion #554

Closed jonoomph closed 3 years ago

jonoomph commented 3 years ago

I wanted to open a discussion on the way OpenShot current handles a few somewhat inter-related things, including audio data, image data, and the sequence of how they are merged.

Frame: Design Goal It was always my vision/design that a video could be represented by openshot::Frame objects, and the Frame would be a dependable and self-contained data structure, and would contain all the data needed to represent a Frame of video (or really a slice of time). It could be passed around in any order, and would be consumed by many different Classes and Methods.

Audio Alignment One of the most noticeable (and bothersome) issues with libopenshot is pops/crackles and audio data loss. This happens because the Timeline class expects all Frame objects to contain a specific number of audio samples. Sometimes however, due to many varying factors, some Frames have either too many or too few audio samples. The Timeline does not handle this well, and when merging multiple overlapping Frame objects together, it finds the smallest amount of samples that agree, and truncates any additional audio samples.

For example: Track 2: Clip 1, Frame 100, 1440 samples Track 1: Clip 2, Frame 200, 1444 samples (4 samples would be lost/truncated, so it matches the above Frame)

This could result in an audio pop, depending on the audio data, or it could be silent.

Audio Division It turns out, not all samples rates are divisible by all frame rates evenly. This is just the reality of video editing, and not much can be done about this. Some Frame objects will always have a different number of samples. The is exaggerated on the Timeline class, which can further position a Clip (i.e. add an offset of the frame number), potentially shuffling the Frame objects further, and creating more frequent mismatches in # of audio samples when merging overlapping Frame objects.

Smarter Merging One potential solution is for the Timeline::add_layer method to be much more thoughtful in how it merges audio, and add special handling to the case where Frame instances have mismatching # of samples, or in the case where Frame instances don't have the expected number of samples. The Timeline might need to keep track of the "last sample" position on different Clip objects, and might need to request 1 or more additional Frame instances, to fill in missing samples.

Smarter Clips Another potential solution is to add a Clip:AlignAudio(int frame, int expected_samples), and have the Timeline class call this new method for each clip instance (prior to calling GetFrame). This could allow the Clip object to fix the number of samples to the desired amount, prior to returning the frame object. And since we already loop through and call Clip::GetFrame for the next X frames (to cache the images), we could easily incorporate this additional call. Similar solution to the previous one, but refactors the logic into the Clip instance. This would allow the Timeline to be even dumber, and just blindly merge audio samples for overlapping frames, because the Clip object has already fixed/aligned the audio samples correctly prior to calling Clip::GetFrame.

While on the topic of smarter Clip instances, I also really love the idea of refactoring all QImage and QTransform code into the Clip::GetFrame method, so that all Keyframes on a Cilp are "resolved" into pixels prior to returning a Frame object. This would allow the Timeline to be even dumber, so it can just combine the images blindly for overlapping Frame instances.

I would love to get more thoughts on these topics, before we make a decision on the best way to improve things. The ultimate goal is to remove all audio data loss, and never lose any audio information while merging Frame instances on the Timeline. Audio pops are one of the most frustrating parts of OpenShot's current source code, and I would love to find a solid design to remove them for good!

jonoomph commented 3 years ago

@ferdnyc I'm also remembering a discussion we had about some potential audio improvements we could make, including not casting audio sample data types, and using the Juce audio buffer in the type we decode? I don't recall the specifics, but feel free to jump in with any of those specifics too, if you think it would help simplify the audio pipeline. :+1:

eisneinechse commented 3 years ago

This might be a good time to handle the audio problem one gets when using a ffmpeg compiled from the current master. For quite some time now the binaries produced cannot export audio, the error is an invalid parameter when sending the audio to be encoded. If this isn't handled the next version of ffmpeg won't work for openshot which would be very unfortunate as some interesting new things are included.

jonoomph commented 3 years ago

As an example of the problem, here are all the common sample rates divided by all common frame rates. If the samples per frame are not perfectly divisible, we ultimately have a problem on the Timeline::add_layer() method.

Also, it's worth noting, that if a FrameMapper is also used (which it often is in normal use), the samples per frame can be further randomized due to resampling and some samples remaining in the resampler, which exaggerates the issue when merging overlapping Frame instances.

---------------------------------
Common Sample Rates / Common FPS
---------------------------------
FPS: [15.0, 23.976023976023978, 24.0, 25.0, 29.97002997002997, 30.0, 50.0, 59.94005994005994, 60.0]
Sample Rates: [8000.0, 11025.0, 16000.0, 22050.0, 44100.0, 48000.0, 88200.0, 96000.0, 176400.0, 192000.0, 384000.0, 3528000.0]
---------------------------------

(Sample Rate / FPS) = Samples Per Frame
---------------------------------
BAD:    8000.0 / 15.00 =    533.33
GOOD:   11025.0 / 15.00 =   735
BAD:    16000.0 / 15.00 =   1066.66
GOOD:   22050.0 / 15.00 =   1470
GOOD:   44100.0 / 15.00 =   2940
GOOD:   48000.0 / 15.00 =   3200
GOOD:   88200.0 / 15.00 =   5880
GOOD:   96000.0 / 15.00 =   6400
GOOD:   176400.0 / 15.00 =  11760
GOOD:   192000.0 / 15.00 =  12800
GOOD:   384000.0 / 15.00 =  25600
GOOD:   3528000.0 / 15.00 = 235200
BAD:    8000.0 / 23.98 =    333.66
BAD:    11025.0 / 23.98 =   459.83
BAD:    16000.0 / 23.98 =   667.33
BAD:    22050.0 / 23.98 =   919.66
BAD:    44100.0 / 23.98 =   1839.33
BAD:    48000.0 / 23.98 =   2001.99
BAD:    88200.0 / 23.98 =   3678.67
BAD:    96000.0 / 23.98 =   4003.99
BAD:    176400.0 / 23.98 =  7357.34
BAD:    192000.0 / 23.98 =  8007.99
BAD:    384000.0 / 23.98 =  16015.99
BAD:    3528000.0 / 23.98 = 147147.0
BAD:    8000.0 / 24.00 =    333.33
BAD:    11025.0 / 24.00 =   459.37
BAD:    16000.0 / 24.00 =   666.66
BAD:    22050.0 / 24.00 =   918.75
BAD:    44100.0 / 24.00 =   1837.5
GOOD:   48000.0 / 24.00 =   2000
GOOD:   88200.0 / 24.00 =   3675
GOOD:   96000.0 / 24.00 =   4000
GOOD:   176400.0 / 24.00 =  7350
GOOD:   192000.0 / 24.00 =  8000
GOOD:   384000.0 / 24.00 =  16000
GOOD:   3528000.0 / 24.00 = 147000
GOOD:   8000.0 / 25.00 =    320
GOOD:   11025.0 / 25.00 =   441
GOOD:   16000.0 / 25.00 =   640
GOOD:   22050.0 / 25.00 =   882
GOOD:   44100.0 / 25.00 =   1764
GOOD:   48000.0 / 25.00 =   1920
GOOD:   88200.0 / 25.00 =   3528
GOOD:   96000.0 / 25.00 =   3840
GOOD:   176400.0 / 25.00 =  7056
GOOD:   192000.0 / 25.00 =  7680
GOOD:   384000.0 / 25.00 =  15360
GOOD:   3528000.0 / 25.00 = 141120
BAD:    8000.0 / 29.97 =    266.93
BAD:    11025.0 / 29.97 =   367.86
BAD:    16000.0 / 29.97 =   533.86
BAD:    22050.0 / 29.97 =   735.73
BAD:    44100.0 / 29.97 =   1471.47
BAD:    48000.0 / 29.97 =   1601.60
BAD:    88200.0 / 29.97 =   2942.94
BAD:    96000.0 / 29.97 =   3203.20
BAD:    176400.0 / 29.97 =  5885.88
BAD:    192000.0 / 29.97 =  6406.40
BAD:    384000.0 / 29.97 =  12812.80
BAD:    3528000.0 / 29.97 = 117717.6
BAD:    8000.0 / 30.00 =    266.66
BAD:    11025.0 / 30.00 =   367.5
BAD:    16000.0 / 30.00 =   533.33
GOOD:   22050.0 / 30.00 =   735
GOOD:   44100.0 / 30.00 =   1470
GOOD:   48000.0 / 30.00 =   1600
GOOD:   88200.0 / 30.00 =   2940
GOOD:   96000.0 / 30.00 =   3200
GOOD:   176400.0 / 30.00 =  5880
GOOD:   192000.0 / 30.00 =  6400
GOOD:   384000.0 / 30.00 =  12800
GOOD:   3528000.0 / 30.00 = 117600
GOOD:   8000.0 / 50.00 =    160
BAD:    11025.0 / 50.00 =   220.5
GOOD:   16000.0 / 50.00 =   320
GOOD:   22050.0 / 50.00 =   441
GOOD:   44100.0 / 50.00 =   882
GOOD:   48000.0 / 50.00 =   960
GOOD:   88200.0 / 50.00 =   1764
GOOD:   96000.0 / 50.00 =   1920
GOOD:   176400.0 / 50.00 =  3528
GOOD:   192000.0 / 50.00 =  3840
GOOD:   384000.0 / 50.00 =  7680
GOOD:   3528000.0 / 50.00 = 70560
BAD:    8000.0 / 59.94 =    133.46
BAD:    11025.0 / 59.94 =   183.93
BAD:    16000.0 / 59.94 =   266.93
BAD:    22050.0 / 59.94 =   367.86
BAD:    44100.0 / 59.94 =   735.73
BAD:    48000.0 / 59.94 =   800.80
BAD:    88200.0 / 59.94 =   1471.47
BAD:    96000.0 / 59.94 =   1601.60
BAD:    176400.0 / 59.94 =  2942.94
BAD:    192000.0 / 59.94 =  3203.20
BAD:    384000.0 / 59.94 =  6406.40
BAD:    3528000.0 / 59.94 = 58858.8
BAD:    8000.0 / 60.00 =    133.33
BAD:    11025.0 / 60.00 =   183.75
BAD:    16000.0 / 60.00 =   266.66
BAD:    22050.0 / 60.00 =   367.5
GOOD:   44100.0 / 60.00 =   735
GOOD:   48000.0 / 60.00 =   800
GOOD:   88200.0 / 60.00 =   1470
GOOD:   96000.0 / 60.00 =   1600
GOOD:   176400.0 / 60.00 =  2940
GOOD:   192000.0 / 60.00 =  3200
GOOD:   384000.0 / 60.00 =  6400
GOOD:   3528000.0 / 60.00 = 58800

eisneinechse commented 3 years ago

Just in case someone is wondering where these strange numbers come from : 24000/1001, 30000/1001, and 60000/1001 and this 1001 is the problem. After a little math one can see that the only sample rates that work with all of them is multiples of 120000. 120000 / ( 24000/1001) = 5005 120000 / ( 30000/1001) = 4004 120000 / ( 60000/1001) = 2002 and this works with the other fps too quite well. I wonder why nobody is using 120000 samples per second.

jonoomph commented 3 years ago

This is the point where the audio gets messed up (on Timeline.cpp). As you can see, it resizes the audio samples of Timeline::new_frame to match the source clip frame. However, this code is called for each overlapping Frame on the Timeline. So, if you can imagine, if there are multiple overlapping Frame instances, with slightly different samples... the behavior is hard to predict.

// TODO: Improve FrameMapper (or Timeline) to always get the correct number of samples per frame.
// Currently, the ResampleContext sometimes leaves behind a few samples for the next call, and the
// number of samples returned is variable... and does not match the number expected.
// This is a crude solution at best. =)
if (new_frame->GetAudioSamplesCount() != source_frame->GetAudioSamplesCount())
    // Force timeline frame to match the source frame
    #pragma omp critical (T_addLayer)
    new_frame->ResizeAudio(info.channels, source_frame->GetAudioSamplesCount(), info.sample_rate, info.channel_layout);

For example, let's say:

F1/Frame 1: 1000 (First frame, Timeline would resize new_frame to match it: 1000 samples... no problems yet)
F2/Frame 2: 1001 (Timeline would resize new_frame bigger, leaving F1 with an blank)
F3/Frame 3: 999 (Timeline would then resize new_frame smaller, and truncate F1 and F2) The more overlapping frames processed, the more likely audio data will be destroyed or corrupted

After the Timeline is done, new_frame would have the # of samples based on the last processed Frame (999 in this example). So, just depending on the "last frame processed", it might remove audio samples or add silence/padding, both of which can be audible as a gap in the audio waveform. How audible depends on the underlying waveform, but often, you can hear a small pop/click/crackle.

ferdnyc commented 3 years ago

Frame: Design Goal It was always my vision/design that a video could be represented by openshot::Frame objects, and the Frame would be a dependable and self-contained data structure, and would contain all the data needed to represent a Frame of video (or really a slice of time).

I think we'd be remiss, though, if we didn't at least examine that decision (in the context of this issue, specifically), at least to avoid being artificially limited by it in how we attempt to address this. There are very good arguments for that design. I know some of them, but there are probably others I'm missing.

But there are also some reasonable arguments for why it can be problematic, including everything we're discussing here. The simple fact of the matter is, video frames aren't a span of time -- they're an instant of time, a snapshot that lasts however long where nothing changes. That makes them inherently at odds with audio data, which is constantly in flux throughout the life of the frame.

Frame rate mapping is where this conflict is most apparent. To remap the video frames, all you do is slide them around and decide which ones go where, possibly duplicating or removing some at intervals. But for the audio, none of that is correct. The only really correct thing to do with the audio is to stitch all of the frames' samples back together into a continuous stream, then chop them back up again into new-frame-rate sized chunks.

But then the obvious question has to be asked: why chop them up in the first place, if there's any chance they might just have to be put back together and chopped up again? I feel like there's room to interpret that very thing as a red flag about how we're managing the audio stream (which is what it is -- or is supposed to be anyway).

So, let's at least put it on the table: What if we separated the audio from the frame objects, and stored it in its own buffer as one continuous stream?

Clearly this creates the possibility of a/v synchronization issues -- that's danger number one. If we were to do this, there will be problems in that regard that aren't there now simply because, if every block of audio samples is stored with its frame image, they can't get very far from each other.

But I think that downside needs to be viewed in the context of the issues we're experiencing now. It's not a new problem, it's a tradeoff against other problems.

And, the basic Frame structure can still retain aspects of the current design. For instance, instead of storing all of its audio samples, what if each and every Frame object held a pointer to a specific sample position in the audio buffer? Frame rate remapping would just become a question of repositioning all of those pointers, the same way the frames themselves are repositioned on the Timeline.

There are other potential upsides to this, too. Like, for instance, the video effects have absolutely no need to receive the audio data for the frames they're processing. It's a fairly lightweight operation, sure, since the frames are passed around by shared pointer -- but it's still unnecessary baggage. While the effect is processing the video image, another thread could be busy working on the audio buffer, perhaps resampling it (which, by the same token, doesn't require any of the video frames).

I think it would make encoding cleaner, too. Right now, there's all this code in FFmpegWriter for, "Oh, we have extra samples from this packet, so we have to stick them in a buffer to hold for the next one, and make sure to flush it at the end..." It mostly works, but there are known bugs we've so far failed to find with that EXACT process, and more to the point that code is so unreadably spaghettified at this point that we may never find them.

Encoding audio from a stream? You read, you advance your read pointer, and then you're done until the next read. Done. There's a lot to be said for that.

*shrug* I wanted to at least throw it out there. More later, I'm out of the house for the afternoon.

ferdnyc commented 3 years ago

@ferdnyc I'm also remembering a discussion we had about some potential audio improvements we could make, including not casting audio sample data types, and using the Juce audio buffer in the type we decode? I don't recall the specifics, but feel free to jump in with any of those specifics too, if you think it would help simplify the audio pipeline.

Yeah, oh that... the thing I'd commented on, at the time, was the fact that we're always:

Storing audio in Frame objects as floating-point samples
Initially converting it (I believe always, or at least nearly-always) from 16-bit integer samples, by hand, on the way in from FFmpegReader
Then on output, first converting it (I believe always) and again by hand, from floating-point back to 16-bit integer samples
And possibly converting it one more time into the output format, if the output format is something other than 16-bit integer.

I believe at the time I described that as "nuckin' futz." :rofl: I stand by that assessment. Fight me!

Even ignoring what I wrote in my previous comment regarding audio streams vs. frame samples, if nothing else there's room to adjust the sample-processing pipeline to be more efficient.

The samples in the openshot::Frame objects are stored in a buffer of type juce::AudioSampleBuffer. For quite some time now, juce::AudioSampleBuffer has been nothing but a typedef for juce::AudioBuffer<float>. As the templated type definition implies there, we're not in any way limited to just storing floating-point samples in juce::AudioBuffer. It supports lots of other sample formats besides float — every other available format, in fact.

There's no reason we couldn't store the Frame samples as juce::AudioBuffer<Int16LE> instead. (Which, for the majority of cases where we don't really need to manipulate the samples, is fiiine since it's exactly how they came in and it's exactly how they'll be going out again — why perform two conversions along the way?)

When we do have to convert or manipulate the samples, JUCE has lots of tools I think we're underutilizing, like the AudioBuffer class's broad selection of buffer-wrangling methods.

Here's what we use of those, currently:

applyGain()
addFrom()
applyGainRamp(), in one place.

Here's what we're missing out on:

Making good use of applyGainRamp()
The equivalent addFromWithRamp() for smoother audio mixing (Heck, I wouldn't be surprised if that one applyGainRamp() in our code is followed by passing the result to an addFrom(), when that could all be done in one step. ...Hah! I just checked, it totally is 100% doing exactly that. See Note 1 below.)
A whole host of copyFrom() equivalents (including ones that apply linear gain and a set of WithRamp() versions)
Two reverse() methods that can operate on either a portion of a channel or a portion of the buffer, which... if that exists, then why does this code also (still) exist?

https://github.com/OpenShot/libopenshot/blob/f71051e8f1add0b893ffaa9a799625017978e7f8/src/Clip.cpp#L378-L403

(Worse, our reverse_buffer does a double copy, reversing into a new buffer and then copying it over the original. I was going to say that I'd bet money the JUCE version is an efficient in-place reverse, but then I looked and it's soooo much worse than that. Wanna see the JUCE version? Check Note 2.)

And then there's the AudioData container class's AudioData::ConverterInstance<SourceSampleFormat, DestSampleFormat>, a templated conversion interface between unlike buffers which has replaced their entire raft of conversion utility functions.

When that exists, why are the majority of our sample conversions performed by hand using tight for() loops? Why are any of them performed that way?

Notes

https://github.com/OpenShot/libopenshot/blob/f71051e8f1add0b893ffaa9a799625017978e7f8/src/Timeline.cpp#L479-L495

This makes me cry a little bit. Here's both methods. IN. THEIR. ENTIRETY.

/** Reverses a part of a channel. */
void reverse (int channel, int startSample, int numSamples) const noexcept
{
    jassert (isPositiveAndBelow (channel, numChannels));
    jassert (startSample >= 0 && numSamples >= 0 && startSample + numSamples <= size);

    if (! isClear)
        std::reverse (channels[channel] + startSample,
                      channels[channel] + startSample + numSamples);
}

/** Reverses a part of the buffer. */
void reverse (int startSample, int numSamples) const noexcept
{
    for (int i = 0; i < numChannels; ++i)
        reverse (i, startSample, numSamples);
}

Let's review our less-flexible version again, for the sake of comparison:

// Reverse an audio buffer
void Clip::reverse_buffer(juce::AudioSampleBuffer* buffer)
{
    int number_of_samples = buffer->getNumSamples();
    int channels = buffer->getNumChannels();

    // Reverse array (create new buffer to hold the reversed version)
    juce::AudioSampleBuffer *reversed = new juce::AudioSampleBuffer(channels, number_of_samples);
    reversed->clear();

    for (int channel = 0; channel < channels; channel++)
    {
        int n=0;
        for (int s = number_of_samples - 1; s >= 0; s--, n++)
            reversed->getWritePointer(channel)[n] = buffer->getWritePointer(channel)[s];
    }

    // Copy the samples back to the original array
    buffer->clear();
    // Loop through channels, and get audio samples
    for (int channel = 0; channel < channels; channel++)
        // Get the audio samples for this channel
        buffer->addFrom(channel, 0, reversed->getReadPointer(channel), number_of_samples, 1.0f);

    delete reversed;
    reversed = NULL;
}

ferdnyc commented 3 years ago

(This entire line of thought should maybe be a completely separate conversation, since it's totally unrelated to audio handling, and I'm about to take it even farther off topic. Just say the word and I'll split any/all of our relevant comments out to a new Issue of their own. But, since it's come up...)

@jonoomph

While on the topic of smarter Clip instances, I also really love the idea of refactoring all QImage and QTransform code into the Clip::GetFrame method, so that all Keyframes on a Cilp are "resolved" into pixels prior to returning a Frame object. This would allow the Timeline to be even dumber, so it can just combine the images blindly for overlapping Frame instances.

Resolution-dependence

I have one primary reservation about that, which is that doing it correctly requires Clips to know the output resolution, and to generate frames in that resolution. If a 1080p Clip were to always blindly generate 1080p Frames, even when exporting at (say) 4K, then that would be a step backward, since the current compositing that happens in the Timeline is done at output resolution. That makes a difference in certain scenarios, for instance when it's compositing media at unlike resolutions.

Say you're creating a project to export as a 4K video. You have a 1080p clip that you import, scale down to 50% size, and overlay onto the 4K-sized background as an insert.

Because scaling is done by the Timeline currently, that 1080p video at 50% scale will be overlaid onto a 4K background at its full resolution with no loss of detail. When the Timeline's QPainter composites it into the output frame, it'll be drawing a 1920×1080 image into a quarter-size region (both dimensions halved) of a 4K frame, which would measure exactly 1920×1080.

So, to preserve at least the level of detail-preservation we currently have, Clips generating Frames for the Timeline need to know not only their own image resolution, but what resolution to generate the Frames at.

`OpenShot::Settings`

My first version of this comment was an extremely long reply that I ultimately decided not to post. (Seriously... 2714 words, my text editor says; so, yes, believe it or not this is the significantly "short" version.) The majority of it was devoted to an extended rant about OpenShot::Settings and what a disaster I feel it's been for the libopenshot codebase.

As I said, I'm not posting that version, but suffice it to say I absolutely do feel that way, I have a pretty extensive argument to back it up, and solving this by using MAX_WIDTH and MAX_HEIGHT to control the Frame output resolution would be the wrong approach. MAX_WIDTH and MAX_HEIGHT (along with HIGH_QUALITY_SCALING and all of OpenShot::Settings' other side effects masquerading as an API) are utterly broken and need to go away.

If, OTOH, both the Timeline::GetFrame() and Clip::GetFrame() APIs were enhanced to take (optional, of course) arguments for output width and height, so that Frames could be requested from them in a specific resolution, then that would improve not only this case, but a number of others where code is currently forced to fight with MAX_WIDTH and MAX_HEIGHT just to accomplish simple things.

Timeline API issues (caused by both `OpenShot::Settings` and the timeline cache)

Primary example of code (our own f---ing code!) having to fight against the Timeline API to accomplish basic things: The nonsense code in OpenShot's actionSaveFrame_Trigger. Capturing a single frame from the Timeline at full resolution currently requires it to:

Move aside the Timeline cache
Create a new cache to use for the snapshot frame
Change the values of things like MAX_WIDTH and MAX_HEIGHT in OpenShot::Settings which control how the fame is generated
Request the needed Frame from the Timeline (which triggers it to render and cache an entire range of frames, despite the rest of them being completely unnecessary)
Have the Frame's image written to the filesystem as a PNG
Restore the OpenShot::Settings values back to their previous state
Swap the Timeline's previous cache back into place (after which the Timeline still won't display the cache state properly, due to a bug in the code that draws the cache line. Oddly enough it wasn't designed to handle crazy shenanigans of this sort.)

That is absolutely insane and should never be necessary under any circumstances, but due to the current Timeline API there's no better way to do it.

The export code is forced to resort to similar shenanigans anytime it needs the Timeline to render an output video without ruining the quality — which honestly makes this even worse, considering video export is supposed to be one of libopenshot's primary functions.

Now, if the Timeline had a Timeline::GetFrame(frame_num, width, height, scaling_quality) method, which automatically ignored its cache and in turn made whatever Clip::GetFrame(frame_num, width, height, scaling_quality) calls are necessary for it to assemble the requested Frame, then that would actually solve multiple problems, and the API would at the very least get less broken.

ferdnyc commented 3 years ago

I did want to post one excerpt from my anti-OpenShot::Settings diatribe, two concrete examples of it causing problems for users (only one of them me):

@jeffski was completely taken by surprise by OpenShot::Settings::Instance()->HIGH_QUALITY_SCALING. Hard to blame him, since it's completely undocumented in the contexts where it's relevant. There is no hint about that variable in any of the API docs for Timeline, Clip, Frame, ReaderBase, WriterBase, or any of their derived classes. And that's especially unfortunate, because it defaults to false, meaning that unless you're aware of the setting and know to change its value, the default may negatively impact the quality of video processed by libopenshot. (#419)
I had to fix the OpenShot::Settings unit tests recently to get libopenshot to build for the upcoming Fedora release. One reason singletons leave a bad taste in many developers' mouths is, they're notoriously hard to test. (How do you write unit tests for code when it's impossible to isolate it from the rest of the system?) Sure enough, that bit us on the ass. (Or, bit us again. I'd actually hit this once before doing Windows builds.) The settings changes that tests/TestSettings.cpp made to test the class were then affecting the rest of the unit tests. In the worst cases those kind of side effects can (and did!) lead to crashes or unexpected changes in the behavior of unrelated code. (#557)

ferdnyc commented 3 years ago

The settings changes that tests/TestSettings.cpp made to test the class were then affecting the rest of the unit tests.

(BTW, because of this I now see the wisdom in the way tests are managed using CTest. (That's CMake's unit-testing framework.) In CTest each individual unit test is meant to be built and run as a separate program. A test directory (or, usually, directory tree) that's littered with potentially hundreds of separate source files, all being compiled into hundreds of separate executables, seems like a cluttered and confusing way to run tests (because it is). But doing that would protect against unexpected interactions that can occur if the tests are compiled into the same binary. Executing the tests in separate execution environment isolates them from each other, even in cases where some of the code has systemic side effects.)

jonoomph commented 3 years ago

There's no reason we couldn't store the Frame samples as juce::AudioBuffer<Int16LE> instead. (Which, for the majority of cases where we don't really need to manipulate the samples, is fiiine since it's exactly how they came in and it's exactly how they'll be going out again — why perform two conversions along the way?)

@ferdnyc Thanks for the detailed feedback, as always! I agree with most (maybe all) of your suggestions, so no arguments here. I'll respond to each point in a separate message, but regarding the Audio pointer... Having a Frame instance with a pointer to the Reader->audio_buffer, and some variables to track the start and length of audio samples related to the Frame, is very interesting. It is mostly hidden behind the scenes, and our Frame::GetAudioSamples() and Frame::GetAudioSamplesCount methods would still hide the implementation details (for most users). This seems like a no-brainer.

The benefits we receive include:

No more copying sample buffers everywhere (the Reader would contain the only copy)
FrameMapping becomes less complex (just changing the internal indexes, start/length variables)
Simpler FFmpegWriter code (continuous reading of the audio buffer... from the Readers)

Potential drawbacks:

Readers must maintain the audio_buffer indefinitely (more memory... although, not too much) Open() and Close() methods would not be able to destroy the audio buffer, because Frame instances would still be linked to it.
Still doesn't solve the remapping complexities, as we are still iterating through Frame instances and setting audio data parameters per Frame.
It does not solve the Timeline merging issue though, but it does simplify it a bit.
Because a Reader cannot fully decode all audio at once, we would only have a partial audio buffer at best, and if the user seeks around, the could become quite crazy I think.

Overall though, especially if we combine this with using the JUCE audio buffer typedef correctly, this could really simplify (and purge) a bunch of libopenshot code. And that would be an improvement.

jonoomph commented 3 years ago

After discussing with @BrennoCaldato extensively, here is my current thinking.

Moving the audio data out of the Frame instances might cause more issues than it fixes. As mentioned above, having the Reader filled with random chunks of audio sample data, and lots of references/pointers to these various chunks is problematic. Consider the following examples:

Example 1: Imagine a user randomly clicking/seeking through-out a Clip, each time it calls Seek() on the reader, and each seek is not guaranteed to match the sample positions of a previous seek. So each seek would need it's own audio buffer, and Frames would be linked to the different buffers, creating all sorts of strange problems. This is actually similar to the current approach, where Frame objects hold there own sample data, and Seeks make the Frames slightly incompatible with each other (if they are near each other). In other words, it ends up being very, very similar to our current approach, except the variable # of audio buffers make things more complex in my opinion... see next example.

Example 2: Imagine a 2 hour Clip (1 source video), and the reader is continuously calling GetFrame (i.e. during an export). The shared audio buffer approach would grow indefinitely, filling with uncompressed audio data, and we would only have the 20 or 30 most recent Frame instances still in memory pointing to the more recent samples. But how would we know to remove the old, no longer needed samples (since it's all one huge sample buffer). Multiply this times many Clips with long videos, and you could quickly run out of memory. And resizing the array constantly, or copying the recent samples into a new audio buffer seems similar to the problem we have now (lots of small audio sample buffers). This example is worse than our current approach, which by each Frame holding it's own samples, we solve many caching/memory issues automatically, because they fall out of memory nicely.

Example 3: We have dozes of Frame instances all linked to Clip A. Clip A is no longer needed (the user has clicked to a far away timeline location), and thus Clip A calls Close() and is deleted. However, we still have dozens of cached Frame instances that need there audio data. If closing a Clip instance and deleting it is no longer permitted, we again, have some serious memory issues with ever growing arrays of uncompressed audio data.

Example 4: If we refuse to create multiple audio sample buffers (for each Seek), and we try and use a single one, we would be adding samples from different locations into the same buffer (do to seeking). This would effectively be mixing many locations of audio sample data together, and requiring more work to keep things sorted and identifiable. And still has all the same memory issues mentioned before. How do we identify what portions of the array/buffer are no longer needed? Too much complexity here.

As much as I love the concept of a single, shared audio buffer (owned by the Reader), and all Frame instances pointing to a different location on it, I think it is ultimately incompatible with random seeks, memory limits, and would end up being more complex than the current approach. :disappointed:

jonoomph commented 3 years ago

Ultimately, the way JUCE handles audio sources is a good hint at how OpenShot should approach things. When samples are needed, the owner (i.e. the Timeline in our case) requests the # of samples it needs (i.e. getNextAudioBlock (const juce::AudioSourceChannelInfo& info) or in our case Clip::GetFrame(int64 frame_number, int num_samples). When JUCE is mixing together audio buffers, it requests a specific # of samples from all sources it needs.

I like the approach of the Timeline requesting exactly what it needs from a Clip:

For example: Clip::GetFrame(int64 frame_number, int num_samples, int width, int height), and the Frame instance that is returned matches these specifications perfectly. It solves all our Timeline audio merging issues. The only requirement would be to request Frame instances in sequence, which we can do with a #pragma omp ordered. And the Clip GetFrame() overload might need to move a few extra samples to the next frame, or steal a few from the next frame (if needed). But regardless of how it does it, it needs to return the exact # of audio samples.

jonoomph commented 3 years ago

Resolved in https://github.com/OpenShot/libopenshot/pull/571

OpenShot / libopenshot

Audio Refactoring Discussion #554

Notes

Resolution-dependence

OpenShot::Settings

Timeline API issues (caused by both OpenShot::Settings and the timeline cache)

`OpenShot::Settings`

Timeline API issues (caused by both `OpenShot::Settings` and the timeline cache)