AddictedCS / soundfingerprinting

Open source audio fingerprinting in .NET. An efficient algorithm for acoustic fingerprinting written purely in C#.
https://emysound.com
MIT License
937 stars 188 forks source link

[Information Request] Best way to approach the inaccuracy of the matched result time? #209

Closed jack4455667788 closed 4 months ago

jack4455667788 commented 1 year ago

Related : #196 Sound fingerprint match always a few seconds/milliseconds too early majority of the time

In the above issue, you responded

@sezonis The algorithm has a certain level of matched accuracy which cannot be improved due to various non-trivial reasons. A worst-case scenario (under ideal conditions of having the same source/target track as the query) can yield about 500ms of misalignment.

Is the algorithm created/tuned to prioritize efficient/accurate matching at the expense of time range accuracy?

I guess I'm hoping you might be able to help me understand the non-trivial reasons so that I might be able to decrease (or ideally remove) that misalignment - and/or any other possible approaches to the problem.

I don't mind if it is hideously inefficient/slow, but I need/want better precision than this. I have tried most everything I can think of to tune/configure this problem away, but the somewhat random inaccuracy persists.

In my case the sounds may be so similar that I might be able to do raw bitstream compares on them (I'm thinking to shift them around the given match timerange until they match up more or less exactly)... Can I somehow "shift" the fingerprints/fingerprinting to do something analogous using soundfingerprinting?

Thanks in any case!

AddictedCS commented 1 year ago

What use case are you trying to solve that requires a time-location accuracy of more than 500 ms?

AddictedCS commented 1 year ago

Referenced commits 762a9e2 are fixing issue #207, adding a comment to remove confusion.

jack4455667788 commented 1 year ago

What use case are you trying to solve that requires a time-location accuracy of more than 500 ms?

I am chasing the (admittedly somewhat unreasonable) dream of removing ads / intros / outros / recaps / undesirable repetitive content completely, while simultaneously retaining all desired content, automatically/programmatically with no user input beyond a folder full of similar content and 2 files to fingerprint.

It would be wonderful if this could be done, and I don't really understand why it can't.

Assuming this was a "brick wall" I have already implemented a manual user interface to scan fwd/back through the video/audio frames (around the query result match suggested time) so the user can specify the precise points where the undesirable content begins and ends - but it would be much preferred if that were not necessary.

sezonis commented 1 year ago

What use case are you trying to solve that requires a time-location accuracy of more than 500 ms?

I am chasing the (admittedly somewhat unreasonable) dream of removing ads / intros / outros / recaps / undesirable repetitive content completely, while simultaneously retaining all desired content, automatically/programmatically with no user input beyond a folder full of similar content and 2 files to fingerprint.

This is for youtube right? Yea, that was pretty much what I was using it for. I ended up just using image recognition along with this library to get it "more accurate". Sure, it uses more resources but it gets the job done.

As for accuracy, I think it's very complicated to get it "pixel perfect" (like near 0ms) because I'd have to assume that the sound waves are always "similar" in some sense, so the algorithm has to make sure it's a match. This library isn't the only library with this issue as well, there are a few others and I don't think it's a simple solution to get it always accurate. Like, it was able to get the audio for me at 0ms with my tricks, by giving it a slightly earlier sound and it matched it at least with a 50ms delay, instead of 500ish. But of course, 1 small change of sound (even if it's so tiny only a program can see it) the delay goes back up. If you want a real foolproof solution, then I suggest you use this library in conjunction with something else, to ensure it is accurate.

jack4455667788 commented 1 year ago

This is for youtube right?

My hope is that it might be for most everything that has repeating content to be removed.

I ended up just using image recognition along with this library to get it "more accurate". Sure, it uses more resources but it gets the job done.

I was attempting to do that with this library (as it does videofingerprinting as well) but I didn't get very far (it didn't recognize the common content across the two sub-clips AND it appeared to be fingerprinting the entire video instead of just the segment specified by startsAtSecond and secondsToProcess). The real trouble is that without first knowing. precisely, where the first frame of the content to remove is - image recognition doesn't really help. My hope is for all this analysis and comparison to be done programmatically and require no user input.

Out of curiosity, what did you end up using for the image recognition? And was it able to compare two different video streams and find the common frames between them (ideally with frame perfect accuracy)?

This library isn't the only library with this issue as well

I've noticed! I've only tried a few others so far, but they have the same inaccuracy issue.

Like, it was able to get the audio for me at 0ms with my tricks, by giving it a slightly earlier sound and it matched it at least with a 50ms delay, instead of 500ish.

I'm doing something similar, and often the delay isn't so bad - but I want it to be 0. If it isn't the hashing algorithm itself, I think the random stride may be involved in the inconsistent results - but I am hoping to understand the problem better in any case.

If you want a real foolproof solution, then I suggest you use this library in conjunction with something else, to ensure it is accurate.

Thanks for the tips! I'm open to any suggestions you might have regarding the "something else"!

nicko88 commented 1 year ago

FWIW I try to use this software for timing accuracy in order to sync external systems with media playback (using real-time matching). The more accurate timing the better in my case.

AddictedCS commented 1 year ago

startsAtSecond and secondsToProcess

@jack4455667788 the issue #207 had a broader effect, and in case you were using these parameters in conjunction with MediaType.Audio | MediaType.Video during fingerprinting or query I suggest you upgrade to v8.24.0 where it was fixed, and see if you get better results.

It would be wonderful if this could be done, and I don't really understand why it can't.

Intuitively you can think of it as a discretization problem, the challenge of transforming a signal (audio in this case) into a set of discrete fingerprints that approximate it. There is a resolution that defines a fingerprint (i.e., 128x32) which approximates about 1.48 seconds of audio signal. These fingerprints are generated using a certain stride, a step between consecutive fingerprints. By default, the stride is 512 samples during fingerprinting (92 ms) and a random value between [256, 512] during query (46 - 92 milliseconds) (values defined in Configs class).

You can decrease the stride between consecutive fingerprints during fingerprinting (say to an extreme case of 1 sample) to increase the chances of having a perfectly aligned fingerprint during query time, but this will substantially increase the footprint of your model service that stores these fingerprints (generating 512x more fingerprints):

FingerprintCommandBuilder.Instance
                .BuildFingerprintCommand()
                .From(pathToFile)
                .WithFingerprintConfig(cfg =>
                {
                    // specifying a stride of 1, meaning we will create new fingerprints with a step of one sample (~0.18 ms )
                    cfg.Audio.Stride = new IncrementalStaticStride(incrementBy: 1);
                    return cfg;
                })
                .UsingServices(audioService)
                .Hash()

This is still not a good solution because even with perfectly aligned signals, you can have distortions generated by encoding/aliasing that will prevent perfect matches. The default values have been empirically defined to maximize recall and precision while minimizing the audio signal's footprint.

Now to the problem of cutting the ads to the precise frame. SoundFingerprinting.Emy contains a strategy that can help you with your problem. There is an experimental class named EdgeSearchStrategy which looks for edges in a video file.

How it works: once you identify a match, you can run a second analysis over the video looking for edges (i.e., black frames and scene changes) around the area where you expect the content to have started/ended. This implies you need access to the matched content (for example, if you are matching over streaming content, you need to generate a file from the streaming match that covers the area where the match happened).


var StartEndEdgeSearchLocationDelta = 3;

// audio object of type QueryResult
var optimalLength = audio?.BestMatch.Track.Length;

// this file has to cover the area of the audio.BestMatch
// also it is recommended to extend the area of the match by StartEndEdgeSearchLocationDelta
// as an example, if your match happened at 09:30:00 till 09:30:30 (hh:mm:ss), then extend the area of the analyzed content by 3 seconds at start/end location 09:29:57 till 09:30:33 (totally extending the match by 6 seconds)
var extendedMediaFile = "path to streaming content that matched";

var edgeSearchStrategy = new EdgeSearchStrategy(new NLogLoggerFactory());
var edgeSearchConfig = new EdgeSearchConfig(new BlackFramesFilterConfiguration { Threshold = 32, Amount = 94 }, SceneChangeThreshold: 0.4, OptimalLength: optimalLength, StartsAtHint: StartEndEdgeSearchLocationDelta, EndsAtHint: StartEndEdgeSearchLocationDelta + optimalLength);
var mediaSegment = edgeSearchStrategy.FindMediaSegmentClosestToOptimalLength(extendedMediaFile, edgeSearchConfig);

if(mediaSegment != null)
{
       // better edges have been found
}

Keep in mind this is an experimental API, and you need FFmpeg installed to use it https://github.com/AddictedCS/soundfingerprinting/wiki/Audio-Services.

Let me know if it any of the above helped.

AddictedCS commented 1 year ago

Hey @jack4455667788 did anything from the above message helped in solving your issue?

AddictedCS commented 4 months ago

Closing due to inactivity.