ern2150 / FVCR

Creative Commons Zero v1.0 Universal
7 stars 1 forks source link

Automated Content Matching Endeavors #13

Open ern2150 opened 3 years ago

ern2150 commented 3 years ago

https://github.com/ern2150/FVCR/blob/example-stats/script-examples.md#video-match

General idea is to be able to find common video clips across one or more video files.

Slightly more specific/relevant would be the ability to identify all the Vortex clips without manually scrubbing around for them. There are lots of side-benefits and implications that will be described later. The next most relevant ability would be identifying the timecode of those common clips across one or more broadcasts, or even more specifically, one or more mixtapes.

The idea is that when you record a new broadcast or obtain a new mixtape, you can analyze its collection of video fingerprints and report on which clips it shares with other broadcasts/mixtapes, and the times they begin and end in your new video file. This might also reveal new common clips — eg a clip that only occurs in one other already analyzed video file.

The actual tools and code necessary to automate this might be detailed enough to require a separate write-up, and some early tests with those are documented in the link above. Discussions below will focus first on the workflow and organization of the information, and assume we have a way to extract it.

ern2150 commented 3 years ago

Video fingerprints could be stored in at least one collection, with at least six fields:

Nickname fingerprint id file count video file start time end time
BERL a1b2c3d4... 200 mix14.mp4 01:30:00 01:32:30
mix64.mp4 00:56:30 00:59:00
... ... ... ... ... ...
FUN f9e8d7c6... 10 mix32.mp4 00:05:45 00:05:52
00:49:15 00:49:22
mix45.mp4 00:10:30 00:10:37
... ... ... ... ... ...
Unique66.1 a5f0b4e9... 1 mix66.mp4 00:03:15 00:04:20

This collection could then be sorted to provide meaning. In the example above, it’s sorted so the most common clips are at the top, and the most unique at the bottom.

It could also be sorted by the “video file” and then by “start time” to show a progression through a mixtape or broadcast, with the changes in the “Nickname” and “file count” fields providing additional meaning. eg “Oh cool, this mixtape starts with Mary instead of Jack, and doesn’t have any Hanuman in it.”

ern2150 commented 3 years ago

So if you get a new video file, you need to fingerprint it first. Depending on the tool, length of video, and configuration, this could produce quite a lot of data. For example, with the proof of concept linked above, for a 3 hour video this generated more than 7200 clips, each with at least 5 swirls of data that were more than 200 bits long. This can be condensed to a somewhat human readable 32 character hex string for each clip (using a compression or checksum algorithm like md5), but that's still 2400 rows of data per hour of video. More can be discussed about how to organize a separate collection, so that the main collection only focuses on common fingerprints and what videos contain them, and the detailed collections could be for each video.

As part of the fingerprinting, you'll also need the start and end times to fit into the collection information above. The tool may provide this for you, or you may need to do some manipulation to store it in a format that is human readable. For example, the poc tool generates frame numbers. As long as the frame rate is constant and known for the video file, this should be a simple equation of frame number divided by frames-per-second, and then translated from raw seconds into hours/minutes/seconds for readability.

Once you have that long list of timed fingerprints for that file (in memory or something), you can start to merge them with the main collection.

What do you do if there are matches with already-Nicknamed fingerprints? You'll need to update the existing data (for file count), as well as your new data (for Nickname) across all rows.

Most storage/organization systems will have built-in ways to query them and generate aggregates, so it may not even be necessary to deliberately update the file count as part of that merge process, or even store the file count in the collection at all. It may be as simple as generating a report on the data itself that can count the instances where the particular fingerprint is present.

In any case, you'll want to keep the Nickname you've assigned to an already-analyzed fingerprint with the matching fingerprints from the new file as part of the merge. This could be as simple as looking ahead to see if the fingerprint already has a Nickname before adding the new data to the collection, or you might be able to use a tool that does this easily on your behalf.

What do you do if there are matches with not-yet-Nicknamed fingerprints? Congrats, you've found a new common clip! You should probably inspect it using your favorite media player and the time codes provided to decide on a Nickname. You could also cross-check it with the video files from the existing data and their time codes.

Once you've decided, you can then update all the rows for that fingerprint with the new Nickname, as well as use that Nickname for the fingerprint data you're adding to the collection from the new file. There are probably tools out there that would prompt you for this, and writing a small interactive script to do it wouldn't be terribly difficult.

What do you do if there aren't any matches with fingerprints for some/most/all of the new file? This is probably going to be the case the majority of the time. The lazy thing to do would be to just add the data without giving each clip a Nickname, and letting future matches be the motivator. You could also give each of these fingerprinted clips an automated Nickname that gives just enough information at-a-glance to indicate it came from a certain file or date.

Now that the collection is updated, you can then refresh your view of it and re-sort/filter (again your tool might do this for you).

ern2150 commented 3 years ago

What would be the path of least interaction something automated could take?

Let’s say you have a directory full of video files, and you just want to run a single script in that directory, which would then give you a report of the collection it made, based on the recommendations above. The report would ideally be short and easy to read.

What if that script could also generate a playable video from each of the newly-matched common clips, and provide links to them in that report? It would probably be helpful to name those videos automatically, perhaps using the names of the fingerprint hashes and the source videos, and maybe even their time codes.

That script could prompt the user to enter Nicknames for those clips.

It could also decide to include those clips as videos in the collection itself. This would be helpful in several ways, but what if, with an extra bit of intelligence, the next time the script ran, it could still find those files, even if they’d been renamed? This way, between the two runs, the user could watch and rename those clips. The script could then use the new name of the clip file as the Nickname for the fingerprint in the collection!

This workflow might change the whole idea of the Nickname as a separate, editable field in the collection. Instead, the Nickname for any given fingerprint would be a reference to the filename of the video clip representing that fingerprint. Identifying that file as “special” might be mildly complex, but maybe that filename only appearing once in the collection would be the clue?

ern2150 commented 3 years ago

The automated Nickname, or idea of a Clip of Origin, gets a little more complicated with this next idea.

What if you also had access to the actual Origin of a common clip? In other words, what if you also had fingerprinted a whole movie, and a clip from it showed up across multiple mixtapes?

Would you want those movies treated the same way as mixtapes, eg collections of repeated and unique clips? Maybe not, but there’s more to the story.

One of the common sources for these mixtapes are movies from IFD and Filmark. Those “studios” are infamous for, you guessed it, stealing clips from other movies, usually redubbed from Hong Kong or the Philippines, to use as part of their movies.

So what if you also had video files of those original Hong Kong / Filipino movies? Wouldn’t these workflows find common clips between those originals and IFD? Those clips could then be renamed to reflect their common origin.

ern2150 commented 3 years ago

All of this so far has focused on the visual similarities between videos and completely ignored audio/dialog/music without being explicit about that.

Should the two be considered the same? In the last section about movies being redubbed and clipped to form parts of IFD movies, you wouldn’t get a 100% match if you strictly had to have video and audio the same. The Art of the stream Intros that open a broadcast is that much of the time they are deliberate “mismatches” of a visual clip from a mixtape with a song that wasn’t there originally. Sometimes those songs are actually from a different common clip!

Let’s then assume we want to treat them separately, but with an equal amount of detail and focus. What would change about the information we’re collecting?

ern2150 commented 3 years ago

Here’s the (updated from prior ramblings) video fields:

Nickname Origin file fingerprint id file count video file start time end time

We’d need audio fields:

Origin file fingerprint id audio file start time end time

Doesn’t seem that different. While you might be able to, at a glance, know the difference between a video file and audio file based on the file extension or Origin or format of the fingerprint id, maybe it would just be helpful to have a separate field.

Origin file fingerprint id media file start time end time

This new “media” field leaves room for future expansion, such as transcripts of dialog, optical character recognition of on-screen text or subtitles, actual separate subtitle tracks, or even chat text / commentary. That’s getting quite a bit out of scope of this discussion, however.

ern2150 commented 3 years ago

So what would you expect to see in the new collection format, typically?

Origin file fingerprint id media file start time end time
BERL.mp4 a1b2c3d4... video mix14.mp4 01:30:00 01:32:30
VicSepanski-Starglide.mp3 1a2b3c4d... audio mix14.mp4 01:30:00 01:32:30
BERL.mp4 a1b2c3d4... video mix64.mp4 00:56:30 00:59:00
... ... ... ... ... ...
FUN.mp4 f9e8d7c6... video mix32.mp4 00:05:45 00:05:52
FUN.mp3 9f8e7d6c... audio
... ... ... ... ... ...
? a5f0b4e9... video mix66.mp4 00:03:15 00:04:20
? 5a0f4b9e... audio

So if prior matches have made an audio clip, and you recognize the song and rename the clip to reflect the song it represents, then that will be the origin for that fingerprint the same way the video clips would. This means the same segment of a mixtape would show up twice in this collection, once for the video, and once for the audio, and the origins would look different. This would still be the same even if you didn’t recognize the song, but the name of the origin file would probably look more like the video one.

The workflows described above would be the same (though likely with different fingerprinting mechanisms underneath), and you would just see “doubles” of what you may have expected otherwise.

Once you’ve identified both video and audio for a given segment, you might sort or filter for that particular song and be surprised how many other times it shows up, mismatched from the video you’d expect! Or it might be the same Tangerine Dream song you’d totally expect to hear everywhere.

ern2150 commented 3 years ago

Quick bit of statistics on the Origin front. Right now the Codex has identified 370 movies that have been featured in at least one mixtape. This list is not exhaustive. Across various playlists, there are at least that many songs identified, and probably slightly more. These lists are also not exhaustive.

Both however do represent links to media that can be downloaded (mostly), and therefore fingerprinted, and therefore matched. It would take a fair bit of storage, of course, and this represents close to a month of running time.

This is mentioned to consider yet another division in the simplicity of the collection, or at least the views/reports you present to the consumer/user.

It might be worth using that media field, or even a separate field, to denote “type” “kind” or maybe even “presentation”. For example, nothing beyond the filename right now distinguishes between a mixtape, a broadcast, a clip, a movie, or a song.

If we go back to that early “oh, cool” statement above, yes, you could filter by a specific mixtape filename, and then sort by start time. But maybe you’re wanting to see how common a video clip is across all mixtapes and Intros, and don’t really care that it came from an Indonesian film from 1971 (and wouldn’t want that in your aggregate count). This field (or a mutation of the media field) would help that filter.

That information could still be in the same large collection, just filtered out of most views. If instead you chose to store it separately, you could make a reference to it using the Origin field. You could also then store the Origin Clips separately, and each kind of information could have its own metadata. A mixtape is more like an episode of a tv show than it is a movie, even though all three have debuts, lengths, and broadcast dates. A mixtape doesn’t really have actors the same way those other two do, but it definitely has an editor. This kind of metadata is quite a bit out of scope for now, however, but separating the presentation types’ collections is worth considering.

ern2150 commented 3 years ago

So a script would need to, each time, generally:

If, for example, the reporting interface is something interactive like a web page, it may be possible to skip the clip generation altogether and simply embed the source video, starting at the specified start time.

ern2150 commented 3 years ago

Above when referring to "clips" an assumption has been made about the "wholeness" or "completeness" of a match between two video files.

The assumption is that a clip can be of nearly any length. If for some reason you assumed, for example, ALL SEVEN MINUTES of Adelic Penguins was a single "clip", and it showed up in more than one mixtape at that length, you'd be right, crazy as it sounds.

It is of course more normal for a clip to be less than a minute (BERL tends to be), but there are segments from movies that are much longer and still feel like a "clip" instead of ... whatever a longer definition would be :)

You could try to define a clip as only being one "scene". Movies are made of scenes, such as a character who is sitting indoors expressing a desire to go to the zoo, and then in the next scene we see them at the zoo. Scenes are sometimes confused for sequences, or a series of scenes that make up a broader idea, like going to the zoo. So in the case where, in several mixtapes, Mickey and Co serve as a palate cleanser, would saying "let's go to the zoo" and then right after being at the zoo (and mocking the hippos) be two separate "clips"? You wouldn't think so.

So could you define a clip as a sequence, and can sequences contain only one scene? Sure! Are two video logos that play one right after the other on the VHS copy of a particular movie a "sequence" of two "scenes"? Sure! What if it's more common to just see one of those two, isolated, and only one mixtape has them both... do you then abandon the sequence idea and just call them both individual "clips"? Sure!

Can one "clip" be a subset of another? For example, in the commercial about pool tables where the lady spins around and says "Oh!", it's fairly common to just show her reaction (as a reaction to the clip that just happened from a completely unrelated thing). It has, however, been played as (most of) the commercial itself in several mixtapes. Would you call those two separate "clips", but understand that one is inside the other? Sure!

ern2150 commented 3 years ago

So how would "fingerprints" be different from "clips"?

Fingerprints are the things that uniquely identify a defined length of video, and they work best when that length is consistent.

Therefore a "clip" could, if the defined length is exactly the same, be the same as a "fingerprint", but this has proven to give somewhat confusing results (see https://github.com/ern2150/FVCR/issues/14#issuecomment-860060914).

It's more likely that matching clips are a sequence of very similar, if not identical, fingerprint matches. This depends on the tool that does the matching, of course, but it is far more likely. Matching individual fingerprints within certain thresholds seems to be the main goal for most of the tools researched thus far.

So to group a sequence of matching fingerprints as a "clip" might take a little bit of brains. I'll not go too deep into implementation details here, but basically if you find a fingerprint match in two video files, and then you find another one in the same two video files, and the end time of one is right next to the start time of the other in both files, you've found a sequence. You might want to define how many of those matching fingerprints you consider the minimum to be considered a sequence, and you may even want to define (or provide as a parameter to the user) a maximum -- eg if there are more than an hour's worth of matches, did you just find an identical file?

ern2150 commented 3 years ago

So to further belabor the point, let’s assume a “fingerprint” is only two seconds long at most. Some “clips” are only two seconds long, like a title card saying TWO YEARS LATER HONG KONG, and therefore that clip only contains one “fingerprint”.

Most clips, however, are longer, and are therefore going to contain a chain or ordered sequence of multiple fingerprints.

Sometimes a chain of fingerprints is made up of other chains of fingerprints, like in the billiard ad / OH lady example. This could be handled in a different way, where the larger chain is broken up into three chains: one before the smaller one, the smaller one, and one after.

You may have noticed this exposes a lack of detail in the metadata. If the Origin is the whole billiard ad, how do I tell the difference between the beginning, middle, and end?

ern2150 commented 3 years ago

You could simplify those new wrinkles by having oddly detailed Origin filenames and more rows:

Origin file fingerprint id media file start time end time
Billiard-PreOH-2s.mp4 a2b2c3d4… mixtape 11 0:00:33 0:00:35
Billiard-PreOH-4s.mp4 a2b2c3d5… mixtape 11 0:00:35 0:00:37
Billiard-PreOH-6s.mp4 a2b2c3d6… mixtape 11 0:00:37 0:00:39
Billiard-OH-2s.mp4 b2c3d2e2… mixtape 11 00:00:43 00:00:45
Billiard-PostOH-2s.mp4 b2c3d2e3… mixtape 11 0:00:45 0:00:47
Billiard-PostOH-4s.mp4 b2c3d2e4… mixtape 11 0:00:47 0:00:49

…but that kinda looks silly. We’re also trying to solve two problems at once while making everything harder to read.

ern2150 commented 3 years ago

What are we looking for, high-level? Might be helpful to remind ourselves.

We want to find common clips across multiple mixtapes. So no matter how much detail could be in the data, that’s a fairly simple two-column, one-to-many, highly structured data set. Even though we may ultimately care where that clip is “from”, that’s not important in the moment of just knowing what is common.

Clip Match
OH Mixtape 8
OH Mixtape 11
OH Mixtape MM6
Billiard Hall before OH Mixtape 8
Billiard Hall before OH Mixtape MM6
Billiard Hall after OH Mixtape 8
Billiard Hall after OH Mixtape MM6
BERL Mixtape 6
... ...
BERL Mixtape 66
BERL, interrupted Mixtape 9
BERL, interrupted Mixtape 23
BERL, resumed Mixtape 9
BERL, resumed Mixtape 23

Simple enough, right? There are some repeated bits, and what this of course leaves out is how you came to title those bits, even though the same idea as above would apply - you would need to do so manually at some point.

ern2150 commented 3 years ago

So what's the next most relevant data? Timecodes. Not just one set, as defined thus far, but two sets for every match. It's silly this hasn't come up so far. Let's see what this would look like when bolted on to this new simple table:

Clip Clip start Match Match start Match duration
OH 0:00:00 Mixtape 8 0:00:43 2s
OH 0:00:00 Mixtape 11 0:00:43 2s
OH 0:00:00 Mixtape MM6 0:28:02 2s
Billiard Hall before OH 0:00:00 Mixtape 8 0:00:33 10s
Billiard Hall before OH 0:00:05 Mixtape MM6 0:27:57 5s
Billiard Hall after OH 0:00:00 Mixtape 8 0:00:45 5s
Billiard Hall after OH 0:00:00 Mixtape MM6 0:28:04 10s
BERL 0:00:00 Mixtape 6 1:31:05 45s
... ...
BERL 0:00:00 Mixtape 66 55:15 45s
BERL, interrupted 0:00:00 Mixtape 9 1:02:56 10s
BERL, interrupted 0:00:00 Mixtape 23 1:12:13 5s
BERL, resumed 0:00:00 Mixtape 9 1:15:00 35s
BERL, resumed 0:00:00 Mixtape 23 1:15:28 40s
Hn. That seems inefficient, let's try it this way: Clip Clip start Match Match start Match duration
Billiard Hall with OH 0:00:10 Mixtape 8 0:00:43 2s
Billiard Hall with OH 0:00:10 Mixtape 11 0:00:43 2s
Billiard Hall with OH 0:00:10 Mixtape MM6 0:28:02 2s
Billiard Hall with OH 0:00:00 Mixtape 8 0:00:33 10s
Billiard Hall with OH 0:00:05 Mixtape MM6 0:27:57 5s
Billiard Hall with OH 0:00:12 Mixtape 8 0:00:45 5s
Billiard Hall with OH 0:00:12 Mixtape MM6 0:28:04 10s
BERL 0:00:00 Mixtape 6 1:31:05 45s
... ...
BERL 0:00:00 Mixtape 66 55:15 45s
BERL 0:00:00 Mixtape 9 1:02:56 10s
BERL 0:00:05 Mixtape 23 1:12:13 5s
BERL 0:00:10 Mixtape 9 1:15:00 35s
BERL 0:00:05 Mixtape 23 1:15:28 40s
Nearly there. You could argue sorting that would make it clearer what's going on: Clip Clip start Match Match start Match duration
Billiard Hall with OH 0:00:00 Mixtape 8 0:00:33 10s
Billiard Hall with OH 0:00:05 Mixtape MM6 0:27:57 5s
Billiard Hall with OH 0:00:10 Mixtape 8 0:00:43 2s
Billiard Hall with OH 0:00:10 Mixtape 11 0:00:43 2s
Billiard Hall with OH 0:00:10 Mixtape MM6 0:28:02 2s
Billiard Hall with OH 0:00:12 Mixtape 8 0:00:45 5s
Billiard Hall with OH 0:00:12 Mixtape MM6 0:28:04 10s
BERL 0:00:00 Mixtape 6 1:31:05 45s
... ...
BERL 0:00:00 Mixtape 66 55:15 45s
BERL 0:00:00 Mixtape 9 1:02:56 10s
BERL 0:00:05 Mixtape 23 1:12:13 5s
BERL 0:00:05 Mixtape 23 1:15:28 40s
BERL 0:00:10 Mixtape 9 1:15:00 35s
Either way you sort it, this way the fields present are much more likely to be different row by row, with one exception. Can we make it even more efficient now that we don't have as many unnecessarily separately identified "clips"? Yep: Clip Clip start Match Match start Match duration
Billiard Hall with OH 0:00:00 Mixtape 8 0:00:33 17s
Billiard Hall with OH 0:00:05 Mixtape MM6 0:27:57 17s
Billiard Hall with OH 0:00:10 Mixtape 11 0:00:43 2s
BERL 0:00:00 Mixtape 6 1:31:05 45s
... ...
BERL 0:00:00 Mixtape 66 55:15 45s
BERL 0:00:00 Mixtape 9 1:02:56 10s
BERL 0:00:10 Mixtape 9 1:15:00 35s
BERL 0:00:05 Mixtape 23 1:12:13 5s
BERL 0:00:05 Mixtape 23 1:15:28 40s

Since the Billiard clips form a coherent sequence in Mixtapes 8 and MM6, we don't need separate rows, just a start time in the clip and a duration time in the match.

Since the BERL clips are, by nature, interrupted in Mixtapes 9 and 23, they'd keep separate rows, as there's stuff (maybe even literally) in between each separate BERL clip start time in each mixtape.

ern2150 commented 3 years ago

Simplifying like this defers the idea of an Origin and Fingerprint Sequence for each Clip to its own data structures.

The Origin concept, as discussed earlier, might be as simple as adding more rows, and another column/field with some descriptive tagging.

The Clip / Fingerprint Sequence details, though, are probably best left to a separate data structure, and not necessarily one that you would need to see that often. It would still be necessary to aid in the automation of identifying those sequences. It could also serve as a jumping-off point to yet another structure about which files contained those fingerprints, and what tools did the fingerprinting.
In any case, when a tool scans a new file, as stated above, it should work with this Fingerprint Sequence data to see if it's identified a known single fingerprint, an extension to a known sequence of fingerprints, or a new fingerprint and sequence altogether. This new information then needs to update the main Clip / Match data, perhaps (as stated earlier) with a placeholder Clip name if completely unknown, or an automated modification (adding text to the start or end of the name).