critrolesync / critrolesync.github.io

https://critrolesync.github.io
MIT License
9 stars 3 forks source link

Automated timestamp cataloging #7

Closed jpgill86 closed 3 years ago

jpgill86 commented 4 years ago

A sophisticated alternative to human labor (#1) for syncing new and old episodes would be to use an automated algorithm executed weekly:

  1. Download the podcast audio file.
  2. Download the YouTube video and extract the audio.
  3. Use some audio syncing software that can identify the segments that match between the two sources.
  4. Construct the mapping between the two sources and store it in data.json.

I could possibly build something like this to run in the cloud, particularly in my programming language of choice (Python), but I haven't found a library yet for performing the audio matching (step 3). If you know of one, please let me know!

This approach would hopefully allow all future episodes to be synced automatically, without a human needing to manually catalog timestamps.

This may also be a viable solution to #6.

bauersimon commented 3 years ago

This is a very cool idea! I'd like to help with that. There is a library that helps with audio processing from my university: CPJKU/madmom. Though one would need to put more thought in how to design an algorithm that automatically aligns both versions (though I'm happy to help with that).

Some other idea I had would be using speech recognition to find the timestamps (similar to the manual process). Something like Uberi/speech_recognition. Though running this over the several hours is probably very costly so one would need to design this more efficiently.

Hit me up if this is something you'd still like to pursue :wink: (now that Campaign 3 is right around the corner, there is some more syncing to be done). Anyways thanks for your work with this project - awesome stuff!

Edit: Now that I think of it, we also use librosa/librosa for such tasks at our university. Probably also worth a shot.

jpgill86 commented 3 years ago

Hi Simon!

I love these ideas! CritRoleSync has been a bit on the back burner for me for a while, but I hope to get back to it before Campaign 3 starts. A (semi-)automated solution like this would be extremely helpful, since #6 has turned out to be more and more of a problem. For some time now, many of the Campaign 2 timestamps have been incorrect, even after fixing them once or twice before. It's a bit of a mess.

I have not begun exploring audio processing solutions, so you are very welcome to dive into it without fear that it's already a solved problem. The approach I might take would involve creating a timestamped transcript from the podcast audio and trying to align it with the already existing timestamped YouTube transcripts / closed captioning data. For example, see this, this, and this; note that the latter two are incomplete lists, perhaps because of this change in 2019, but the full timestamped caption data must be out there somewhere, or else retrievable from YouTube directly.

bauersimon commented 3 years ago

I'm a bit limited on (coding) time since I'm still recovering from a hand injury but I'd really love to help!

Giving it a bit more thought, I should be able to find similar sections in two audio streams with some help from librosa/librosa. I've had some experience with this so this is possibly easier for me than the speech detection solution. Also if Matt says "last we left off" multiple times the speech detection solution would need some more tweaking.

Only thing is I cannot really gauge is how fast it'll be - yet. But I guess "slow" is always better than "manual" :smile:. And I also have some optimizations in mind already.

The CC idea is also nice! Though it relies on "external data" (which is always a bit tricky) so maybe it would be good to keep this as a backup plan in case the audio analysis is too slow?

bauersimon commented 3 years ago

I did a few experiments over the weekend. Loading an audio file of several minutes into librosa takes already a few seconds on my machine. Hence loading ~8h per episode is not going to work (let alone comparing the waveforms).

Though I did some research and found the perfect solution: worldveil/dejavu. This is an open source implementation of the "shazam algorithm" (hope you're familiar with that). It can generate fingerprints of audio. When provided with a snippet of an audio file, it can find the "original" and the timestamp from where the snippet is from. So the basic approach would be:

Manual:

Automatic:

All in all the initial timestamp and fingerprint generation will take some time. Though this only has to happen once. However one needs to store those 400MB of fingerprints per episode (56GB for Campaign 2). The "realignment" should take 30s per episode (1h for Campaign 2). One can tweak this. Less fingerprints (less disk space) requires more dejavu matching (and therefore more time for each "realignment").


All estimates are done with respect to the dejavu docs.

jpgill86 commented 3 years ago

Wow! That's a great idea!

You've looked at the docs more than I have, but my sense is that dejavu is designed to identify a short music sample from a large library of songs. I think the scale of what we want to do is much, much smaller, so it should hopefully be less resource demanding. Instead of searching a library of dozens or thousands of songs, for each episode we would be searching for the right timestamp within one (or maybe broken into a few) audio files pulled from YouTube. In fact, we may be able to just fingerprint a few seconds at the start, at the end, and before and after the break for the YouTube videos. No need to fingerprint the entire YouTube episode, I hope.

This sounds like a lot of fun to hack on, and I hope I can find some time to do so! Of course, you're welcome to start. I wonder if we could run all this in the cloud using Google Colab...

jpgill86 commented 3 years ago

The dejavu algorithm uses landmarks in the spectrogram of the audio to create the fingerprints. Just so that I could get a sense of the potential here, I created spectrograms using Mathematica for episode C2E20 (over 3.5 hours in length) of both the podcast audio (top) and the YouTube audio (bottom):

spectrograms

Because of differences in volume and colormap normalization, the absolute darkness of colors differs between the two plots. However, if you focus on relative darkness, we can spot several shared landmarks, as well as differences that make sense:

Pretty cool!

Spectrograms of musical audio, which dejavu was designed to work with, have much more structure than these. I hope there is still enough for it to latch onto.


EDIT: Documenting how I made this:

  1. Podcast audio was downloaded by looking up the URL in the podcast feed XML file (which changes each time the ads are changed) and simply downloading the M4A audio file.
  2. YouTube audio was extracted using:
    conda create -n youtube-dl youtube-dl ffmpeg
    conda activate youtube-dl
    youtube-dl --extract-audio jyCoCqhsFp4

    which generates another M4A audio file.

  3. Mathematica code for generating the spectrograms:
    podcastAudioFile = "C:\\Users\\jeffr\\Desktop\\spectrograms\\Podcast Audio C2E20.m4a";
    youtubeAudioFile = "C:\\Users\\jeffr\\Desktop\\spectrograms\\YouTube Audio C2E20.m4a";
    podcastSpectrogram = Spectrogram[Audio[podcastAudioFile], PlotRange -> {0, 2000}];
    youtubeSpectrogram = Spectrogram[Audio[youtubeAudioFile], PlotRange -> {0, 2000}];
    GraphicsColumn[{
      Style[Text["C2E20"], FontSize -> Large],
      Show[podcastSpectrogram, FrameLabel -> {"Time (s)", "Frequency (Hz)"}, PlotLabel -> "Podcast Audio"],
      Show[youtubeSpectrogram, FrameLabel -> {"Time (s)", "Frequency (Hz)"}, PlotLabel -> "YouTube Audio"]
    }, ImageSize -> 800]
bauersimon commented 3 years ago

Glad you like the idea!

It should work pretty well as we don't have a huge library of songs as you pointed out. But keep in mind that there is a tradeoff between how many fingerprints need to be stored from the YouTube version and how many lookups we need with dejavu on each automatic "realignment" run. If we fingerprint a smaller section from YouTube we need more queries to make sure one of them is successful.

Not sure if one can run this entirely in colab since dejavu needs a database to function (it's one key aspect to performance). Its default setup works with docker and I believe you cannot run containers from within colab.

Very cool seeing the spectrograms like this and the differences between the two versions! The dejavu docs state that 6s samples were enough to identify music pieces with 100% success rate. So I believe we should be fine with longer samples (i.e. ~20s) and be able to match non-music audio (with "less" features than music).

I'm also a bit limited on time but can't wait to get going or at least try it out manually to verify it works... or see it working in case you get to it sooner :wink:. I'll start with a little "proof of concept" tonight. My general plan would be to code a little tool that uses the python docker api to manage a dejavu instance and perform the analysis.

bauersimon commented 3 years ago

Thinking about it. One could probably run the realignment as a "github action". There are ~30h per month included in public repositories and one can run containers there.

jpgill86 commented 3 years ago

I love these ideas!

jpgill86 commented 3 years ago

I just added a Python module called download.py. It implements two main functions that will make it much easier to fetch audio files for this project: download_youtube_audio and download_podcast_audio.

EDIT: download_podcast_audio relies upon the most recent snapshot of the podcast feeds stored in feed-archive. For the "critical-role" podcast feed, which covers everything since C2E20, this will become out of date as soon as the ads change next. New snapshots can be obtained by running the first two scripts in that directory.

bauersimon commented 3 years ago

Very cool! I wanted to try out dejavu yesterday but I couldn't get it to run. Not entirely sure why but I suspect my conda might be broken. Need to check this again at some point.

Though I started a tiny little database wrapper that manages a postgres docker container for dejavu via the docker SDK.


EDIT: Pushed to bauersimon/critrolesync.github.io.

bauersimon commented 3 years ago

I'll probably try to set up a vagrant environment to circumvent the installation struggles. This should also help with getting this into a github action.

bauersimon commented 3 years ago

Managed to create a vagrant setup (again @ bauersimon/critrolesync.github.io). dejavu is quite picky when it comes to it's dependencies. With my current setup I get warnings during the installation even though everything works in the end.

I was finally able to do some testing with dejavu, here are my learnings:

bauersimon commented 3 years ago

Alright here is my first take on the realignment algorithm: https://github.com/bauersimon/critrolesync.github.io/commit/51e05314252e148caede652ebf43d07f4d28473d :sweat_smile:. I haven't tested it yet (also because I'm lacking the respective files - need to check out the download.py) and it probably needs some tweaking but it contains the rough ideas of how it would work.


EDIT: But that's as much as I can do for the week :sweat: because I'll be gone for a few days. Will probably continue next Tuesday.

jpgill86 commented 3 years ago

You are the man! Great progress, that's awesome!

I haven't used Vagrant or Docker before, but it's high time I learn how. Running this via GitHub Actions would also be fantastic.

Regarding the matching strategy you outlined before:

So the basic approach would be:

Manual:

  • Since the YouTube videos never change, we can fingerprint them in advance (one-time). It is enough to fingerprint ~1h of the first half of an episode (before the break) and ~1h of the second half. This produces ~400MB fingerprints per episode.

  • We need the correct timings from the YouTube videos (so this part stays a manual process) to calculate the differences to the podcast.

Automatic:

  • From the podcast version extract a ~20s sample each 55min (this has to be <1h since we generated the fingerprints for ~1h slices). This gives roughly 4-5 samples per episode.

  • Match the podcast samples against the YouTube fingerprints with dejavu. This will result in two matches - one for the first half and one for the second. Each match contains also the relative timestamp offset to the YouTube video. (Matching roughly takes 30% of the sample length so ~30s per episode)

  • Obtain the absolute offsets to the YouTube video with the correct timings that were extracted manually.

All in all the initial timestamp and fingerprint generation will take some time. Though this only has to happen once. However one needs to store those 400MB of fingerprints per episode (56GB for Campaign 2). The "realignment" should take 30s per episode (1h for Campaign 2). One can tweak this. Less fingerprints (less disk space) requires more dejavu matching (and therefore more time for each "realignment").

All estimates are done with respect to the dejavu docs.

I have some suggestions for improvements. I think we can get away with fingerprinting much less, saving time and disk space.

For at least most recent episodes, including those with dynamic ads, some differences in timing between the podcast and YouTube are predictable. The intro ads at the start of the podcast episode have never exceeded 4 minutes. The sign-off added to the end of the podcast episode is also very short, about 1 minute.

I think this means we can just fingerprint the first and last ~10 minutes of the YouTube audio, and take just two ~20 second samples from the podcast episode, ~5 minutes from the start and from the end. If dejavu can find the alignment between the first podcast sample and the first YouTube fingerprint, and the same for the later pair, we're done.

Does that make sense?

jpgill86 commented 3 years ago

Alright here is my first take on the realignment algorithm: bauersimon@51e0531 😅. I haven't tested it yet (also because I'm lacking the respective files - need to check out the download.py) and it probably needs some tweaking but it contains the rough ideas of how it would work.

EDIT: But that's as much as I can do for the week 😓 because I'll be gone for a few days. Will probably continue next Tuesday.

Again, amazing work! I'll have to see if in the next few days I can catch up to you! 😂

bauersimon commented 3 years ago

Thanks! Well it's untested and probably not functional yet... 😅

Awesome I get what you mean by the more optimized realignment algorithm! Maybe we can keep both and use the more involved one as a fallback in case the easier one fails?

Vagrant is pretty straightforward. Just install it (and some virtual machine manager like VirtualBox), change to the directory containing the Vagrantfile and vagrant up. That should build everything you need. You can vagrant ssh to get into the machine and change to cd /vagrant to get to the folder that's shared within the machine (i.e. autosync currently). Use vagrant destroy (outside) in case you broke something or move the Vagrantfile, bootstrap.sh and requirements.txt to another folder in case you want to change what the "root" inside vagrant is.

jpgill86 commented 3 years ago

Hey! I got Vagrant working and your playground.py to run, giving the correct result of a 300 second offset with 0.12 confidence! I'm so happy, haha.

(I'm also amazed by Vagrant. I've used VMs for many things, and this was so much easier than setting them up manually. Never again!)

Regarding this:

  • dejavu seems to have problems with large audio files, i.e. fingerprinting a single 10min audio file only took a few seconds but trying a single 30min file didn't terminate even after letting it run >20min, so we probably need to "slice" up our YouTube sections before fingerprinting them

Looks like the VM is configured to have only 4 GB of memory. Perhaps fingerprinting the larger audio file is memory-limited, and increasing the allotted memory will speed it up / let it actually finish. I haven't tested this yet. I still need to study your code to understand how it works. 😄

jpgill86 commented 3 years ago

@bauersimon, I did some reorganizing of the directories and modules. In particular, I moved my Python code into one importable directory. I suggest you move your autosync directory into the new critrolesync directory under src. Eventually, we can make this into one cohesive package.

jpgill86 commented 3 years ago

@bauersimon, I created a new branch called autosync and rebased your commits onto my latest changes.

I also reorganized your code into a subpackage, located at src/critrolesync/autosync. It can be imported using import critrolesync.autosync from the src directory. For now at least, the autosync subpackage is not imported directly when the top-level critrolesync package is imported, since this would require dejavu, docker, etc. every time, and I'd like to be able to use critrolesync outside the Vagrant VM.

playground.py and testdata were moved outside the package since they are currently being used for informal testing. Running sudo python3 playground.py still works, of course.

The Vagrantfile was moved to the root directory so that the whole repo is available in the VM. This is necessary since a file in the docs directory (data.json), as well as the podcast feed archive files located in feed-archive, are needed by the critrolesync package.

If you think this should be organized differently, I'm open to suggestions.

bauersimon commented 3 years ago

I'm quite busy till Tuesday but I'll probably manage to have a quick look at your work later to provide some feedback 😀

Thanks for continuing on - amazing stuff. I'm eager to see everything working - also automated with GitHub Workflows (did some research on that - it should definitely be doable)!

bauersimon commented 3 years ago

Perhaps fingerprinting the larger audio file is memory-limited...

I monitored the resources inside the VM and also of the database inside docker and I wasn't able to find the bottleneck :disappointed:. But maybe we can somehow fix it by playing with the values - it wouldn't surprise me if there was some low level magic that would change things. Though just slicing the audio is also doable and with your optimization in mind we really don't have to fingerprint a lot.

I did some reorganizing of the directories and modules

Just had a quick look - looks nice and clean! I'd only move the bootstrap.sh and requirements.txt to the root directory with Vagrantfile since they kinda belong together and define the environment.

Also something we might want to do (before we "officially merge everything to master) is reverse the testdata audio just to obscure it as you don't really have "copyright" on that. Just so we don't have official critical role content in the repository. It shouldn't change how the tests behave anyways.


As I said I can continue work on the matching algorithm and looking into GitHub Workflows myself only next week. Though I'm always happy to follow your progress and give feedback when I have the time!

Also glad to see you like vagrant :wink:. I've only started using it recently and yeah it's pretty awesome.

jpgill86 commented 3 years ago

Just had a quick look - looks nice and clean! I'd only move the bootstrap.sh and requirements.txt to the root directory with Vagrantfile since they kinda belong together and define the environment.

I see your point and considered doing that myself. However, to some extent, I'm trying to keep the website side of the repo (docs) and the Python code (src) separate. In particular, the dependencies listed in requirements.txt apply only to the Python code, so I thought putting it in src made some sense. I would have kept Vagrantfile in src as well if I could, but my Python code depends on docs/data.json as well as feed-archive, so I needed to go up one level. I'm not super happy with that solution and am wondering if I could have done something with symbolic links or Vagrant configuration options instead.

I can see the argument for keeping bootstrap.sh together with Vagrantfile. If I can't find a way to move Vagrantfile into src, I may move bootstrap.sh up to the root directory.

Also something we might want to do (before we "officially merge everything to master) is reverse the testdata audio just to obscure it as you don't really have "copyright" on that. Just so we don't have official critical role content in the repository. It shouldn't change how the tests behave anyways.

Very, very good point!

jpgill86 commented 3 years ago

Also something we might want to do (before we "officially merge everything to master) is reverse the testdata audio just to obscure it as you don't really have "copyright" on that. Just so we don't have official critical role content in the repository. It shouldn't change how the tests behave anyways.

I wanted to completely remove the testdata from the history of the repo (you can never be too careful! -- well, maybe this is overkill), rather than just revert the commit later, so instead I did some rewriting of the commit history, which I force pushed to the branch. We could have squashed commits later when merging instead, but I generally prefer to preserve commits to make detective work easier.

So that we still have the test files you made, I put them on Google Drive and shared the folder with you using your Salzburg email. Let me know if some other address would be better.

bauersimon commented 3 years ago

You can of course freely configure which folders are shared with the vagrant machine. So it would be possible to put the Vagrantfile into src and share it's "parent" folder in the VM. Though I would not advise that because from what little experience I have with vagrant it's always convention to have the Vagrantfile in the project root.

One other idea I had would be to put bootstrap.sh and requirements.txt into a separate folder called environment/ or environment/python/. This way we have all files that "define" the environment within that directory. I need those as well to setup the cloud environment within the GitHub Workflows and with this setup this is nicely separated from the python code.


About the testdata: I think we had a misunderstanding. Rather than completely removing the data we could just literally "reverse" the audio data (the waveform) such that we can still keep it in the repository and use it for testing but you only hear gibberish stuff when listening to it :wink:. I think that should be enough to obscure what it actually is. I'm also fine with removing it once everything works though. Keep in mind that having them in Google Drive works for us now but not for people forking the repository in the future wondering were all these files are that playground.py needs.

jpgill86 commented 3 years ago

Sweet success!

With https://github.com/critrolesync/critrolesync.github.io/commit/cba4d10662f3999b3d08c6155e06a889a849ea84:

vagrant@ubuntu-focal:/vagrant/src$ time sudo python3 playground2.py
testdata/youtube-slices/C2E100 YouTube - Beginning.m4a already fingerprinted, continuing...
testdata/youtube-slices/C2E101 YouTube - Beginning.m4a already fingerprinted, continuing...
testdata/youtube-slices/C2E102 YouTube - Beginning.m4a already fingerprinted, continuing...
testdata/youtube-slices/C2E103 YouTube - Beginning.m4a already fingerprinted, continuing...
testdata/youtube-slices/C2E104 YouTube - Beginning.m4a already fingerprinted, continuing...

C2E100: Podcast opening ads end at 0:01:48
    All matches:
         Result(name=b'C2E100 YouTube - Beginning', offset=191.70395, confidence=0.27)
         Result(name=b'C2E100 YouTube - Beginning', offset=191.70395, confidence=0.27)

C2E101: Podcast opening ads end at 0:01:23
    All matches:
         Result(name=b'C2E101 YouTube - Beginning', offset=216.87438, confidence=0.16)
         Result(name=b'C2E100 YouTube - Beginning', offset=50.52662, confidence=0.2)

C2E102: Podcast opening ads end at 0:03:05
    All matches:
         Result(name=b'C2E102 YouTube - Beginning', offset=115.17098, confidence=0.14)
         Result(name=b'C2E104 YouTube - Beginning', offset=261.45669, confidence=0.11)

C2E103: Podcast opening ads end at 0:03:14
    All matches:
         Result(name=b'C2E103 YouTube - Beginning', offset=105.88299, confidence=0.18)
         Result(name=b'C2E101 YouTube - Beginning', offset=175.96082, confidence=0.14)

C2E104: Podcast opening ads end at 0:01:41
    All matches:
         Result(name=b'C2E104 YouTube - Beginning', offset=199.36653, confidence=0.28)
         Result(name=b'C2E100 YouTube - Beginning', offset=368.50068, confidence=0.23)

real    1m29.153s
user    0m1.716s
sys     0m1.040s

Focusing on just the opening ads and using episodes C2E100-C2E104 as guinea pigs, I fingerprinted the first 6 minutes of the YouTube audio, and pulled a 20 second sample from 5 minutes into the podcast, after I knew ads would be over. All of the calculated times when the ads end are correct!

Fingerprinting these 5 episodes took some minutes and produced a 60 MB tar file, As you can see from the printout, actually performing the matching against a database already prepared with 5 fingerprints took about 18 seconds per episode.

There are some odd things here, though. Two matches were returned in each case. For C2E100, the second was identical to the first match (both correct). For C2E102-104, the second match is incorrect and has barely less confidence than the first match. Strangest of all, C2E101 actually had greater confidence in an incorrect match, but because it returned that one second (dunno why) and my code assumed matches would be sorted by confidence, I got the right answer. I listened to the segment in the wrong episode that the clip supposedly matched better, and I couldn't hear any resemblance.

By increasing the podcast sample from 20 seconds to 30 or more, perhaps we can ensure there is only one, clearly best match.

My playground2.py has the process of downloading and slicing files completely automated. For slicing, I used ffmpeg, which is very fast and allows the stereo audio to be converted to mono, like you suggested. I initially tried to use pydub.AudioSegment, but it would always crash when trying to merely initialize with a full length episode.

This is exciting progress!

jpgill86 commented 3 years ago

One other idea I had would be to put bootstrap.sh and requirements.txt into a separate folder called environment/ or environment/python/.

I like this. I'll do it now.

About the testdata: I think we had a misunderstanding.

You've got that right! I totally misunderstood. 🤣 But with the new code in playground2.py, which retrieves any episode on demand, I think having audio files checked into git isn't necessary.

bauersimon commented 3 years ago

This is exciting progress!

Indeed amazing work! 🤗

There are some odd things here, though. Two matches were returned in each case. For C2E100, the second was identical to the first match (both correct). For C2E102-104, the second match is incorrect and has barely less confidence than the first match.

We can circumvent this very easily. Just use one database (i.e Matcher) instance per episode and store the fingerprints per episode in separate files. This way there is only one correct match in the database and we will only be interested in the offset. Then matching should also be quicker because there is only one fingerprint candidate to be matched 😇

My version of the complex matching algorithm should be doing this already. So feel free to have a look at that.

bauersimon commented 3 years ago

We already know which episode corresponds to which episode so we don't need the matching to solve this problem as well, is what I'm trying to say 😄

jpgill86 commented 3 years ago

We can circumvent this very easily. Just use one database (i.e Matcher) instance per episode and store the fingerprints per episode in separate files. This way there is only one correct match in the database and we will only be interested in the offset. Then matching should also be quicker because there is only one fingerprint candidate to be matched 😇

Yes, great idea!

I just re-ran the whole thing from scratch, including first tearing down the VM and deleting the database archive and every audio files. Once I had the VM up again, I ran the whole of playground2.py in one go, and it completed without any problems. Very gratifying. Downloading the files, slicing them, fingerprinting them, and matching took a total of 11 minutes. About half that time was downloading.

I got slightly different results in a few cases. I'm not aware of any stochastic aspects to the algorithm, so I will guess something was a little out of tune in my last VM. This isn't too surprising since I tried a lot of things I haven't even mentioned, including building ffmpeg from scratch because I thought I needed a different encoder... all that was unnecessary.

The new results are better vis-a-vis confidence on first matches versus second. Plus, the C2E101 result makes more sense. Here's a diff:

 C2E100: Podcast opening ads end at 0:01:48
     All matches:
-         Result(name=b'C2E100 YouTube - Beginning', offset=191.70395, confidence=0.27)
-         Result(name=b'C2E100 YouTube - Beginning', offset=191.70395, confidence=0.27)
+         Result(name=b'C2E100 YouTube - Beginning', offset=191.70395, confidence=0.19)
+         Result(name=b'C2E103 YouTube - Beginning', offset=260.99229, confidence=0.11)

 C2E101: Podcast opening ads end at 0:01:23
     All matches:
          Result(name=b'C2E101 YouTube - Beginning', offset=216.87438, confidence=0.16)
-         Result(name=b'C2E100 YouTube - Beginning', offset=50.52662, confidence=0.2)
+         Result(name=b'C2E100 YouTube - Beginning', offset=50.52662, confidence=0.12)

 C2E102: Podcast opening ads end at 0:03:05
     All matches:
          Result(name=b'C2E102 YouTube - Beginning', offset=115.17098, confidence=0.14)
          Result(name=b'C2E104 YouTube - Beginning', offset=261.45669, confidence=0.11)

 C2E103: Podcast opening ads end at 0:03:14
     All matches:
          Result(name=b'C2E103 YouTube - Beginning', offset=105.88299, confidence=0.18)
          Result(name=b'C2E101 YouTube - Beginning', offset=175.96082, confidence=0.14)

 C2E104: Podcast opening ads end at 0:01:41
     All matches:
          Result(name=b'C2E104 YouTube - Beginning', offset=199.36653, confidence=0.28)
-         Result(name=b'C2E100 YouTube - Beginning', offset=368.50068, confidence=0.23)
+         Result(name=b'C2E103 YouTube - Beginning', offset=28.65342, confidence=0.14)
jpgill86 commented 3 years ago

dejavu is quite picky when it comes to it's dependencies. With my current setup I get warnings during the installation even though everything works in the end.

I wonder if dejavu's dependencies are pinned because any one of them changing has the potential to invalidate existing fingerprint databases. For instance, a slight change in the way Matplotlib renders the spectrogram might lead to entirely different hashes. If this is possible, we should probably pin some things too, or use dejavu's pins, or keep a record of pip freeze as insurance.

jpgill86 commented 3 years ago

I just pushed 0db054d, which updates playground2.py with calculations of all four timestamps for each podcast. My toy code is pretty ugly, but it's a working demo that gives the right answers!

C2E100
['youtube', 'podcast', 'comment']        confidence
('0:00:00', '0:01:48', 'hello')          0.2
('1:53:50', '1:55:38', 'break start')
('2:18:34', '1:57:55', 'break end')
('3:37:44', '3:17:05', 'IITY?')          0.25

C2E101
['youtube', 'podcast', 'comment']        confidence
('0:00:01', '0:01:23', 'hello')          0.26
('2:26:16', '2:27:38', 'break start')
('2:45:33', '2:29:56', 'break end')
('3:38:36', '3:22:59', 'IITY?')          0.16

C2E102
['youtube', 'podcast', 'comment']        confidence
('0:00:00', '0:03:05', 'hello')          0.14
('1:39:20', '1:42:25', 'break start')
('1:59:24', '1:44:42', 'break end')
('3:26:44', '3:12:02', 'IITY?')          0.28

C2E103
['youtube', 'podcast', 'comment']        confidence
('0:00:00', '0:03:14', 'hello')          0.2
('1:50:29', '1:53:43', 'break start')
('2:07:01', '1:56:00', 'break end')
('3:31:15', '3:20:14', 'IITY?')          0.2

C2E104
['youtube', 'podcast', 'comment']        confidence
('0:00:00', '0:01:41', 'hello')          0.29
('1:14:11', '1:15:52', 'break start')
('1:31:04', '1:18:10', 'break end')
('3:55:19', '3:42:25', 'IITY?')          0.15

This is very exciting!

I tried varying the duration of the samples from the podcasts (one taken at the start of each episode, and one from the end) to see how it affected accuracy, confidence, and speed. I expected the accuracy and confidence to go down with significantly shorter sample durations, and for the confidence to soar up to 100% with longer samples, but I was surprised by the result:

image

The total matching times plotted in blue do not include fingerprinting or slicing. Slicing takes just a few seconds regardless of the sample duration. I only tried fingerprinting 6 minute segments of the YouTube audio, from the beginning and end each.

I was surprised to find that the perfect accuracy never wavered, until I tried 2-second samples (not shown), which gave bad results. Confidence actually decreased for some reason when the sample duration was longer than 10 seconds. Total matching time increased steadily with sample duration, as expected.

From this little test, I would say that, technically, 5-second samples win, with 10-second samples tying for confidence, but taking about 56% longer. However, I only tested 5 episodes, and I am guessing the 10-second sample duration may be more robust to rare occurrences where a 5-second sample would not be unique enough. So, I favor 10-second samples, even if they take longer to match. It's still pretty fast, taking only about 15 seconds per episode!

Note that I had fingerprints from all 10 YouTube audio segments (beginnings and endings of 5 episodes) stored within one database. I think this could be easier for file management, backups, and sharing the database with GitHub Actions. It likely makes matching a bit slower, but it also avoids the overhead of starting up a new Docker container and database, optionally writing an exported backup to disk, and tearing down for each match. I'm not sure which approach (one database .tar per match/episode, or one .tar containing all episodes) would ultimately be faster, but I think the simplicity of one file is pretty valuable. A single file could be preserved as a GitHub Actions workflow artifact pretty easily. (Of course, the many .tar files of the one-database-per-episode could be bundled together too as a last step, but that's even more work.)

jpgill86 commented 3 years ago

I started going through the episodes in order, beginning with C2E20, the first episode to have dynamic ads. I added the ability to programmatically write autosync-ed timestamps to data.json, and I have been saving and manually verifying the autosync-ed timestamps as I go. So far, I've gone up to C2E54. Right now the fixed timestamps are on the autosync branch only, but I will probably copy them over to master soon so that they can go live on the website.

I encountered just three problematic episodes so far:

I also found that I could get away with shortening the segment fingerprinted at the end of the YouTube audio, from 6 minutes down to 2, which saves time and space.

All of this is still being done in playground2.py, which is a terrible, ugly mess compared to your carefully crafted code! 🤣

jpgill86 commented 3 years ago

Alright, I'll concede that trying to keep all fingerprints in a single database was a Bad Idea. 😅 I fingerprinted all of C2E20-C2E140 (which produced a ~1 GB .tar file); attempts to match using it frequently lead to "soft locks", and when that doesn't happen, performing a single match takes ages.

Now I'm trying what you suggested before, having a separate .tar file for each episode. Fingerprinting seems to be going somewhat slower, but matching is super fast. Since fingerprinting only needs to happen once, this seems to be the way to go. 👍

jpgill86 commented 3 years ago

OK, 365d945 is working pretty well with each episode's fingerprints saved to its own .tar file.

Like the other live episodes I mentioned, C2E73 & C2E97 also have trouble finding the first timestamp thanks to the crowd.

I think I'd like to implement a system where the slice times can be configured differently for individual episodes. Those parameters could be stored in another JSON file. This will allow us to work around things like the live episodes, or other idiosyncrasies that might make one or a handful of episodes fail with slice times that otherwise work for most episodes. Campaign 1 is likely to need different slice times anyway.

jpgill86 commented 3 years ago

By the way, the total size for the fingerprints of C2E20-C2E140 when stored in separate files is very similar to (actually a little smaller) than when all fingerprints were stored in a single .tar. Also, the directory of fingerprints for the 121 episodes can be compressed down to a 165 MB .zip file, which is pretty efficient.

jpgill86 commented 3 years ago

I fixed every timestamp for C2E20 and later, thanks to the power of (semi-)automation! I'm so happy. I also cleaned up playground2.py a bit. It's still a work in progress.

I want to work on older episodes, but I encountered a new problem. When I try to auto-sync C2E1, I get nonsense. My hunch is that this is caused by the YouTube audio having one sample rate (44.1 kHz, the default in dejavu, and used for all C2E20+) and the podcast audio having another (48 kHz).

C2E1-C2E19 are part of the old Nerdist podcast feed for Campaign 1, and the videos belong to the Geek & Sundry YouTube channel rather than Critical Role. These differences in ownership seem to also come with differences in file format, as well as inconsistencies. C2E1 and C2E15 are different from the others in this set in that the YouTube audio files downloaded by youtube_dl with default settings have the sample rate used by later episodes (44.1 kHz). In contrast, the others have the less common sample rate (48 kHz), which they share with all podcasts in this set; these others also save with an .opus file extension when downloaded by youtube_dl with default settings, rather than an .m4a extension. I can probably request an .m4a version instead, and perhaps it will have the more common sample rate.

The problem remains that we may have some episodes that differ in sample rate between YouTube and podcast. I haven't started looking yet to see if dejavu can handle this.

bauersimon commented 3 years ago

Wooow you've been crazy busy! (For some reason I didn't catch the latest notifications on my phone). I'm amazed with the progress. Good stuff - good stuff! I'm still quite busy with other things so sorry for not being that helpful the last few days.

From this little test, I would say that, technically, 5-second samples win, with 10-second samples tying for confidence, but taking about 56% longer.

I would've really thought that longer samples would mean higher confidence but I guess not :sweat_smile:. Though nice that you tested that!

I fixed every timestamp for C2E20 and later, thanks to the power of (semi-)automation!

Awesome :relaxed:

When I try to auto-sync C2E1, I get nonsense. My hunch is that this is caused by the YouTube audio having one sample rate.

Weird... I really thought dejavu could handle that :thinking:. Might need some different configuration?


I'd like to get going with the GitHub Actions stuff. Not sure yet how that works yet if I try this on my fork but I guess there's only one way to find out :wink:.

Could you give me some hints on how to use what tooling you built already? I'd like to propose the following workflow (from within the cloud instance of GitHub Actions):

Realignment Workflow (run every week/month?):

New Episode Workflow (run when specifically triggered with new json containing the YouTube offsets and slicing instructions)

jpgill86 commented 3 years ago

I'd like to get going with the GitHub Actions stuff. Not sure yet how that works yet if I try this on my fork but I guess there's only one way to find out 😉.

Wonderful! I've used GitHub Actions quite a bit for other projects, and, yes, it should be possible to get workflows to run on your fork.

Could you give me some hints on how to use what tooling you built already?

This should work, I hope:

cd critrolesync.github.io
vagrant up
vagrant ssh
cd /vagrant/src
sudo python3 playground2.py

Changing this line should allow you to control which episodes the code runs on. Uncommenting this line will give more information about what is happening.

Random hanging has been an annoying problem for me from the beginning. The script will just get stuck in random places at random times. Pressing Ctrl+c (multiple times if necessary) can kill the process. Generally, rerunning the script works fine, but occasionally a Docker container is left running, and then Python will complain about a port already being in use. When that happens, you need to kill the Docker container manually. Use sudo docker container ls to get the IDs of running containers, and then sudo docker container kill <id>.

Even with this intervention, I've had Docker processes running in the background, hogging my CPU. Restarting the VM helps then:

logout
vagrant halt
vagrant up

By far, the slowest step of fingerprinting (besides downloading -- sometimes a YouTube download will move at a snail's pace, and quitting and restarting will get you back to high speed) is inserting the hashes into the Postgres database managed by Docker. Actually creating the hashes for the audio is pretty fast. Furthermore, my impression is that hangups tend to happen on steps involving interaction with the database. Overall, it feels like the database is unstable and somewhat slower than it needs to be (though some of my recent changes made slowness less of an issue -- it's not too bad now). I'd love to find some performance improvements here.

I'd like to propose the following workflow (from within the cloud instance of GitHub Actions):

I like these a lot!

I'm not sure that GitHub Actions can commit to the same repo it's working on, but even if it can, my preference for now would be to have these workflows do no committing, at least not to master (opening a pull request would be pretty cool, though). Rather, once they finish, workflows can make files produced during the process available for manual download as artifacts. I want to inspect every change to data.json before it is published, and I'd like the workflow to generate reports for me to make inspection easier:

I've noticed that when there is a change to the dynamic ads, all podcasts tend to receive the same change. All of their total durations (info contained in the feeds) will increase or decrease by a fixed amount due to the ads changing in the same way for all episodes (e.g., in this feed diff, each of the "Seconds" increased by 12-13 seconds). I want to see that dejavu came up with timestamp changes that agree with this fixed amount. If I see that, I won't feel the need to manually double check every timestamp (which is very time consuming!).

Since the fingerprint database is a large binary file, I want to store that somewhere other than in the GitHub repo. git is not the ideal tool for backing up large binary files. Rather, we may be able to find a solution where the fingerprints are saved on a storage provider like Google Drive, and the GitHub Actions workflow has access to it.

jpgill86 commented 3 years ago

Great progress today:

On to Campaign 1!

jpgill86 commented 3 years ago

5 is rearing its ugly head for Campaign 1, ugh.

bauersimon commented 3 years ago

Furthermore, my impression is that hangups tend to happen on steps involving interaction with the database.

I can try to get rid of the docker container all together and install postgres natively in vagrant. Maybe that helps with the hangups.

jpgill86 commented 3 years ago

5 is rearing its ugly head for Campaign 1, ugh.

Fixed by 3a81aecf. Autosyncing is working for Campaign 1, and I am slowly working my way through it.

jpgill86 commented 3 years ago

With aceda64, playground2.py will create just one database Docker container and reuse it for each Matcher object, so that only one Docker container is created for the entire script, rather than setting up and tearing down a container for each episode every few seconds. This greatly speeds up matching if downloading and fingerprinting are already done.

This change also seems to have improved stability somewhat, though not 100%. Coincidentally, I discovered that if I SSH into the VM using a second terminal while it's stuck and type a command like top, it will get unstuck. I'm not sure what's going on there, but I'm happy to have a workaround that lets it continue, rather than needing to start over!

I've been making more progress on cataloging episodes from Campaign 1, as well as all of Exandria Unlimited.

bauersimon commented 3 years ago

Ahhh perfect! I was half way into the native installation inside vagrant but ran into some complications that I didn't manage to solve yet. But just running the database container continuously seems like a great solution.

jpgill86 commented 3 years ago

Commit 55607d2 now makes it possible to run the script without Vagrant, using only Docker. This seems to be much more stable and even faster than before (after initial container images are downloaded).

With Docker and Docker Compose installed on your machine, just run

cd critrolesync.github.io
docker compose build
docker compose run python /bin/bash
python playground2.py

The original method using Vagrant still works:

cd critrolesync.github.io
vagrant up
vagrant ssh
cd /vagrant/src
sudo python3 playground2.py

I think that if I understood Docker container networking slightly better, the new code would have been quite a bit simpler (e.g., all of this could be removed, and Database could handle the container creation just as it does with Vagrant; perhaps Docker Compose could be eliminated altogether). However, after banging my head against the keyboard for several hours, I'm settling for something that works, even if it's a little more complicated. (I felt like I was probably just one or two parameter settings away from it working. Grrr, frustrating, haha!)

jpgill86 commented 3 years ago

I think that if I understood Docker container networking slightly better, the new code would have been quite a bit simpler [...] (I felt like I was probably just one or two parameter settings away from it working. Grrr, frustrating, haha!)

Turns out my settings were all correct! I just needed to introduce a delay with sleep to give the spawned container enough time to connect to the network. After fixing that (bae02cf), I was able to make the simplifications I wanted (0ee31ad).

I added an entrypoint command to the Dockerfile (caef05b), and I changed the name of the main container, so the commands for launching via Docker only (without Vagrant) have changed a bit:

cd critrolesync.github.io
docker compose build
docker compose run autosync

I also decided to pin Python dependencies to versions known today to give correct results; we don't want Matplotlib changing its colormaps or whatever and suddenly all our archived fingerprints are no longer accurate! I tested that everything works with the latest versions of packages in Python 3.9 today, and pinned to those versions (e701727).

jpgill86 commented 3 years ago

Hi @bauersimon,

I renamed playground2.py to __main__.py and moved it into the autosync subpackage. This means it can now be invoked using pythom -m critrolesync.autosync. I update the Dockerfile to use this command, so nothing really changes if you are running it through Docker.

I merged all of the changes so far into the master branch (#18). Of course, I still think there is more to do. In particular, I want to be able to pass command line arguments to the script so that one does not need to edit the file to configure which episodes it runs on or with what settings (e.g., whether or not to re-fingerprint). Doing that would allow a GitHub Actions workflow to be built which accepts these parameters and runs synchronization in the cloud. There is certainly more refactoring that could be done too. I'm going to leave this issue open since I don't consider it completely resolved yet.

By the way, I built a GitHub Actions workflow, archive-podcast-feeds.yml, which checks for podcast feed changes every hour and opens a pull request automatically when it finds any. It would be wonderful if we could bring a similar level of automation to the execution of autosync. I foresee some difficulties there, since a certain level of human intervention may always be needed (e.g., initial determination of the YouTube timestamps, documentation of YouTube URLs), but I am hopeful we can do more.

bauersimon commented 3 years ago

Awesome progress! I'm happy that you got this far! Glad I could be of some help. I'm quite limited on time still so I'm not sure when I manage to get some more work done on this.

jpgill86 commented 3 years ago

It's DONE! I've finally finished cataloging all of the C1 episodes, and autosync was a huge help. Ready for Campaign 3!

Since we last communicated, I added command line argument parsing (b9a8179), so now it will be possible to run docker compose run autosync C3E1 (after manually cataloging the YouTube timestamps, of course) to determine the podcast timestamps. It accepts flags for re-downloading audio sources, re-slicing, and re-fingerprinting.