ArdenButterfield / stammer

Recreate any audio track by rearranging the frames of another video
MIT License
443 stars 30 forks source link

composition of audio match frames rather than selection of single match frames #23

Open egoughnour opened 1 year ago

egoughnour commented 1 year ago

If the audio match frames are a linear combination of the best matches, then the audio output could be tuned to be more or less like the source audio--truncation at one element is effectively the current implementation. This also leaves open the possibility of interesting combinations of the visual frames (i.e, the visual frames corresponding to the audio matches [basis vectors]). One straightforward visualization would be vertical slices with the width assigned as a fraction of the frame corresponding to its coefficient in the linear combination.

Another possibility is the use of a graphic equalizer-like effect--but which would only work well if there are time periods with relatively moderate changes to the basis vectors/frames. Each bar in the equalizer then represents a frame and the coefficient is its height. A bar could drop out as the influence of the particular frame fell to zero.

geq
egoughnour commented 1 year ago

The linear combination matching is theoretically sound, if that matters. Every finite-dimensional inner product space (such as vanilla Euclidean vectors spaces in N dimensions) is a Hilbert space, so the Hilbert projection theorem applies. I am reasonably certain then, that

np.argmin(np.sum((x-m)*(x-m)), axis=1)

is functionally the same as

np.argmax(np.sum(x*m), axis=1)

The other Hilbert space properties are also nice. One consequence is that most of the checking can be removed from create_output_audio() if we construct linear combinations in this way.

egoughnour commented 1 year ago

22 tiling is complete in image_tiling.py (though not incorporated into stammer.py yet).

The idea is similar to the use of vertical slices, but it takes a different approach to dividing the frame.

TL;DR: walk through the bits of the fractional contribution of the frame to the output. hot bits will all sum to one as we have a partition of unity (and we basically have geometric series with common ratio 1/2, because it's represented in binary).

composite_image

This mean thats to within the limits of the precision available (when displayed on the integer grid of the frame), we can exactly represent the contribution of each frame to the output. For instance, if we have

frame_a * 2/7 + frame_b * 4/7 + frame_c * (1- (2+4)/7) == composite_frame 

then 2/7 and 4/7 of the output video frame is filled with frame_a and frame_b thumbnails, respectively. The remainder is filled with frame_c thumbnails.

ArdenButterfield commented 1 year ago

Hi Erik, sorry I've been MIA the past couple days. This is a super cool idea, and I tested your branch and I'm digging the sound. I may not be following the plot 100% so forgive me if this is a stupid question but I'm not sure that linearly combining spectra works quite as simply as one could hope here, because of phase issues: imagine two frames, both with a sine wave at 1 kHz, just one is shifted in phase by 180º. They'll have the same spectrum (since we are only concerned about magnitude and not phase, and since we average nearby frequency lines on top of that), but adding the two frames together won't make a 1 kHz tone at twice the amplitude— instead it would cancel out.

I really like the direction you're going with this! (And maybe your code does address phase issues and I just missed that part)

As for combining the video frames, I like both of the options you've given so far... I suppose this is more of an art question than a science question. Part of me wonders what it would be like to overlay the frames on top of each-other, with opacity determined by how much they contribute to the overall sound. It would probably look like a jumbled mess, but I suppose that's the name of the game.

I suppose we could also give the user options.

Anyway, I really like what you've got going here, and I'm excited to dive back into the code. I should have more time to work on it this week, I was just busy over the weekend.

egoughnour commented 1 year ago

OK, I thought about this a little more, and among other things I noticed that the frames are being dealt with in terms of the the power spectrum immediately after the Fourier Transform. (That is, we have reals and not complex vectors.) Another aspect of the problem is how linearly dependent the phase content of a contiguous sequence of frames can be. So assumptions about either set of frames being a pre-Hilbert space are wrong. The good thing is this gives an obvious direction to improve in. Start from the pool of candidate frames, take as many frames (-> FT -> spectral coefficient vectors) as needed to make a square matrix. (whatever length the spectral vectors are, that would match the number of vectors needed). These don't have a guarantee of linear independence. Because there is no orthonormal basis anywhere, carry out the complex Principal Component Analysis to get the most useful basis in k basis vectors, where k is the selected final, smaller rank/dimension to deal with. This basis is relevant to the current batch of candidate audio frames. Projection onto the basis vectors (by the complex inner product--with complex conjugation of the second argument before carrying out the normal dot product operation) gives a change of basis. This is the same for frames from both signals. Once there is a common orthonormal basis the assumptions about projection become valid again.

For each batch of vectors, a new basis would be computed, then the signal to be approximated would be projected onto it again. The batch would also be projected onto it.

The total output of this operation would be one projection of the signal to be approximated for each basis, the indexes of the frames in the originating batch and their projections onto the basis. Then find the basis in which the target signal is best approximated--I don't think per frame is the way to go, but on the other extreme it clearly defies the basic intent of the process (and user expectations) to use only a fraction of the possible frames to stand in for the original signal. So select some fraction of the target signal to be approximated and measure the complex inner product of the candidate frames with the spectral vector projections against each set of projected candidate frames. The basis chosen will ideally be one that has large coefficients in the first few basis vectors.

This could be constrained so that the selected basis is not allowed to move too quickly through the time domain. I think that would be enough to prevent phase discontinuities, and it still leaves a lot of granularity in the control over the process.