Finding any repetitive content in large audio file

bharat-patidar commented 5 years ago

Hi, Your work is awesome. I just wanted to know if we can use this code to find repetitive content in a large audio file(say 7 hours). Is there any way so that I can match my file with itself to get portion which are getting repeated. If it is possible, it would be more than great if you can guide me to make changes which are required.

Thanks You!!

dpwe commented 5 years ago

Here’s one thing you could do:

break the 7h recording into ~800 one-minute, 50% overlapping segments.
build an audfprint database containing them all
run each one as a query against the database. Of course, it will match itself and overlapping files as the top hits, so ignore those, but any others will be actual repetitions.

You’ll need to have —max-matches 5 or more to be able to see beyond those degenerate matches.

DAn.

On Thu, May 30, 2019 at 14:21 bharat-patidar notifications@github.com wrote:

Hi, Your work is awesome. I just wanted to know if we can use this code to find repetitive content in a large audio file(say 7 hours). Is there any way so that I can match my file with itself to get portion which are getting repeated. If it is possible, it would be more than great if you can guide me to make changes which are required.

Thanks You!!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dpwe/audfprint/issues/63?email_source=notifications&email_token=AAEGZUKSZIO74HSDT6DQ3X3PYALKJA5CNFSM4HRIK472YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GWZGBOQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEGZUI5SYED7Q5723L5NEDPYALKJANCNFSM4HRIK47Q .

bharat-patidar commented 5 years ago

Okay, I will try this method. Thanks for the help.

brendon-wong commented 4 years ago

@dpwe What's the rationale for splitting the recording into overlapping segments?

dpwe commented 4 years ago

Matching a very long recording is inefficient because a long recording will include nearly every hash somewhere, so the first-pass pruning by common matches won't do much, and the list of matching hashes that have to be sorted by time difference will be very large. Typically, you are interested in knowing roughly where a match occurs; if you break material up into shorter segments, then you get some of that information just from knowing the matching segment. But if you have no idea where the targets are going to occur, there's a chance that arbitrary chopping up will chop in the middle of the matching region, making it less likely to find the match at all (since you only have, worst case, half the duration to match in the two resulting halves). However, with 50% overlapped segments, there's always a segment centered over the split point, so if the split falls into a match region, the overlapped-segment will have the match squarely in the middle, for the best chance of matching.

So, segments should be longer than the excerpts you expect to match.

DAn.

On Tue, Oct 29, 2019 at 7:29 PM Brendon Wong notifications@github.com wrote:

@dpwe https://github.com/dpwe What's the rationale for splitting the recording into overlapping segments?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dpwe/audfprint/issues/63?email_source=notifications&email_token=AAEGZUPUCOJP5C5P7FL473DQRDBLPA5CNFSM4HRIK472YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECSNO3I#issuecomment-547673965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEGZUKMB63OQF7JAYWT2I3QRDBLPANCNFSM4HRIK47Q .

dpwe / audfprint

Finding any repetitive content in large audio file #63