Closed jpgill86 closed 3 years ago
This is a very cool idea! I'd like to help with that. There is a library that helps with audio processing from my university: CPJKU/madmom. Though one would need to put more thought in how to design an algorithm that automatically aligns both versions (though I'm happy to help with that).
Some other idea I had would be using speech recognition to find the timestamps (similar to the manual process). Something like Uberi/speech_recognition. Though running this over the several hours is probably very costly so one would need to design this more efficiently.
Hit me up if this is something you'd still like to pursue :wink: (now that Campaign 3 is right around the corner, there is some more syncing to be done). Anyways thanks for your work with this project - awesome stuff!
Edit: Now that I think of it, we also use librosa/librosa for such tasks at our university. Probably also worth a shot.
Hi Simon!
I love these ideas! CritRoleSync has been a bit on the back burner for me for a while, but I hope to get back to it before Campaign 3 starts. A (semi-)automated solution like this would be extremely helpful, since #6 has turned out to be more and more of a problem. For some time now, many of the Campaign 2 timestamps have been incorrect, even after fixing them once or twice before. It's a bit of a mess.
I have not begun exploring audio processing solutions, so you are very welcome to dive into it without fear that it's already a solved problem. The approach I might take would involve creating a timestamped transcript from the podcast audio and trying to align it with the already existing timestamped YouTube transcripts / closed captioning data. For example, see this, this, and this; note that the latter two are incomplete lists, perhaps because of this change in 2019, but the full timestamped caption data must be out there somewhere, or else retrievable from YouTube directly.
I'm a bit limited on (coding) time since I'm still recovering from a hand injury but I'd really love to help!
Giving it a bit more thought, I should be able to find similar sections in two audio streams with some help from librosa/librosa. I've had some experience with this so this is possibly easier for me than the speech detection solution. Also if Matt says "last we left off" multiple times the speech detection solution would need some more tweaking.
Only thing is I cannot really gauge is how fast it'll be - yet. But I guess "slow" is always better than "manual" :smile:. And I also have some optimizations in mind already.
The CC idea is also nice! Though it relies on "external data" (which is always a bit tricky) so maybe it would be good to keep this as a backup plan in case the audio analysis is too slow?
I did a few experiments over the weekend. Loading an audio file of several minutes into librosa
takes already a few seconds on my machine. Hence loading ~8h
per episode is not going to work (let alone comparing the waveforms).
Though I did some research and found the perfect solution: worldveil/dejavu. This is an open source implementation of the "shazam algorithm" (hope you're familiar with that). It can generate fingerprints of audio. When provided with a snippet of an audio file, it can find the "original" and the timestamp from where the snippet is from. So the basic approach would be:
Manual:
~1h
of the first half of an episode (before the break) and ~1h
of the second half. This produces ~400MB
fingerprints per episode.Automatic:
~20s
sample each 55min
(this has to be <1h
since we generated the fingerprints for ~1h
slices). This gives roughly 4-5
samples per episode.dejavu
. This will result in two matches - one for the first half and one for the second. Each match contains also the relative timestamp offset to the YouTube video. (Matching roughly takes 30% of the sample length so ~30s
per episode)All in all the initial timestamp and fingerprint generation will take some time. Though this only has to happen once. However one needs to store those 400MB
of fingerprints per episode (56GB
for Campaign 2).
The "realignment" should take 30s
per episode (1h
for Campaign 2). One can tweak this. Less fingerprints (less disk space) requires more dejavu
matching (and therefore more time for each "realignment").
All estimates are done with respect to the dejavu
docs.
Wow! That's a great idea!
You've looked at the docs more than I have, but my sense is that dejavu is designed to identify a short music sample from a large library of songs. I think the scale of what we want to do is much, much smaller, so it should hopefully be less resource demanding. Instead of searching a library of dozens or thousands of songs, for each episode we would be searching for the right timestamp within one (or maybe broken into a few) audio files pulled from YouTube. In fact, we may be able to just fingerprint a few seconds at the start, at the end, and before and after the break for the YouTube videos. No need to fingerprint the entire YouTube episode, I hope.
This sounds like a lot of fun to hack on, and I hope I can find some time to do so! Of course, you're welcome to start. I wonder if we could run all this in the cloud using Google Colab...
The dejavu algorithm uses landmarks in the spectrogram of the audio to create the fingerprints. Just so that I could get a sense of the potential here, I created spectrograms using Mathematica for episode C2E20 (over 3.5 hours in length) of both the podcast audio (top) and the YouTube audio (bottom):
Because of differences in volume and colormap normalization, the absolute darkness of colors differs between the two plots. However, if you focus on relative darkness, we can spot several shared landmarks, as well as differences that make sense:
Pretty cool!
Spectrograms of musical audio, which dejavu was designed to work with, have much more structure than these. I hope there is still enough for it to latch onto.
EDIT: Documenting how I made this:
conda create -n youtube-dl youtube-dl ffmpeg
conda activate youtube-dl
youtube-dl --extract-audio jyCoCqhsFp4
which generates another M4A audio file.
podcastAudioFile = "C:\\Users\\jeffr\\Desktop\\spectrograms\\Podcast Audio C2E20.m4a";
youtubeAudioFile = "C:\\Users\\jeffr\\Desktop\\spectrograms\\YouTube Audio C2E20.m4a";
podcastSpectrogram = Spectrogram[Audio[podcastAudioFile], PlotRange -> {0, 2000}];
youtubeSpectrogram = Spectrogram[Audio[youtubeAudioFile], PlotRange -> {0, 2000}];
GraphicsColumn[{
Style[Text["C2E20"], FontSize -> Large],
Show[podcastSpectrogram, FrameLabel -> {"Time (s)", "Frequency (Hz)"}, PlotLabel -> "Podcast Audio"],
Show[youtubeSpectrogram, FrameLabel -> {"Time (s)", "Frequency (Hz)"}, PlotLabel -> "YouTube Audio"]
}, ImageSize -> 800]
Glad you like the idea!
It should work pretty well as we don't have a huge library of songs as you pointed out. But keep in mind that there is a tradeoff between how many fingerprints need to be stored from the YouTube version and how many lookups we need with dejavu
on each automatic "realignment" run. If we fingerprint a smaller section from YouTube we need more queries to make sure one of them is successful.
Not sure if one can run this entirely in colab
since dejavu
needs a database to function (it's one key aspect to performance). Its default setup works with docker
and I believe you cannot run containers from within colab
.
Very cool seeing the spectrograms like this and the differences between the two versions! The dejavu
docs state that 6s
samples were enough to identify music pieces with 100% success rate. So I believe we should be fine with longer samples (i.e. ~20s
) and be able to match non-music audio (with "less" features than music).
I'm also a bit limited on time but can't wait to get going or at least try it out manually to verify it works... or see it working in case you get to it sooner :wink:. I'll start with a little "proof of concept" tonight. My general plan would be to code a little tool that uses the python docker api to manage a dejavu
instance and perform the analysis.
Thinking about it. One could probably run the realignment as a "github action". There are ~30h
per month included in public repositories and one can run containers there.
I love these ideas!
I just added a Python module called download.py
. It implements two main functions that will make it much easier to fetch audio files for this project: download_youtube_audio
and download_podcast_audio
.
download_youtube_audio
: Pass the episode ID, in the form 'C#E#'
, e.g., download_youtube_audio('C2E20')
. This uses youtube_dl
to extract the audio from the YouTube video and saves it locally.
download_podcast_audio
: Pass the title or the beginning of the title of the podcast episode. For all Campaign 2 episodes, the episode ID in the form 'C2E#'
should suffice. For Campaign 1, you will need 'Vox Machina Ep. #'
. For example, download_podcast_audio('C2E20')
or download_podcast_audio('Vox Machina Ep. 99')
.
EDIT: download_podcast_audio
relies upon the most recent snapshot of the podcast feeds stored in feed-archive. For the "critical-role" podcast feed, which covers everything since C2E20, this will become out of date as soon as the ads change next. New snapshots can be obtained by running the first two scripts in that directory.
Very cool! I wanted to try out dejavu
yesterday but I couldn't get it to run. Not entirely sure why but I suspect my conda
might be broken. Need to check this again at some point.
Though I started a tiny little database wrapper that manages a postgres
docker container for dejavu
via the docker
SDK.
EDIT: Pushed to bauersimon/critrolesync.github.io.
I'll probably try to set up a vagrant
environment to circumvent the installation struggles. This should also help with getting this into a github action.
Managed to create a vagrant
setup (again @ bauersimon/critrolesync.github.io). dejavu
is quite picky when it comes to it's dependencies. With my current setup I get warnings during the installation even though everything works in the end.
I was finally able to do some testing with dejavu
, here are my learnings:
-3dB
) and even added some very noticeable 0.02dB
white noise and dejavu
was still able to find the correct spots with 13%
confidence (which doesn't sound like much but is pretty solid - also because the podcasts won't be that "ugly")mono
before passing it to dejavu
cuts processing time in half since each channel is handled separatelydejavu
seems to have problems with large audio files, i.e. fingerprinting a single 10min
audio file only took a few seconds but trying a single 30min
file didn't terminate even after letting it run >20min
, so we probably need to "slice" up our YouTube sections before fingerprinting themAlright here is my first take on the realignment algorithm: https://github.com/bauersimon/critrolesync.github.io/commit/51e05314252e148caede652ebf43d07f4d28473d :sweat_smile:. I haven't tested it yet (also because I'm lacking the respective files - need to check out the download.py)
and it probably needs some tweaking but it contains the rough ideas of how it would work.
EDIT: But that's as much as I can do for the week :sweat: because I'll be gone for a few days. Will probably continue next Tuesday.
You are the man! Great progress, that's awesome!
I haven't used Vagrant or Docker before, but it's high time I learn how. Running this via GitHub Actions would also be fantastic.
Regarding the matching strategy you outlined before:
So the basic approach would be:
Manual:
Since the YouTube videos never change, we can fingerprint them in advance (one-time). It is enough to fingerprint
~1h
of the first half of an episode (before the break) and~1h
of the second half. This produces~400MB
fingerprints per episode.We need the correct timings from the YouTube videos (so this part stays a manual process) to calculate the differences to the podcast.
Automatic:
From the podcast version extract a
~20s
sample each55min
(this has to be<1h
since we generated the fingerprints for~1h
slices). This gives roughly4-5
samples per episode.Match the podcast samples against the YouTube fingerprints with
dejavu
. This will result in two matches - one for the first half and one for the second. Each match contains also the relative timestamp offset to the YouTube video. (Matching roughly takes 30% of the sample length so~30s
per episode)Obtain the absolute offsets to the YouTube video with the correct timings that were extracted manually.
All in all the initial timestamp and fingerprint generation will take some time. Though this only has to happen once. However one needs to store those
400MB
of fingerprints per episode (56GB
for Campaign 2). The "realignment" should take30s
per episode (1h
for Campaign 2). One can tweak this. Less fingerprints (less disk space) requires moredejavu
matching (and therefore more time for each "realignment").All estimates are done with respect to the
dejavu
docs.
I have some suggestions for improvements. I think we can get away with fingerprinting much less, saving time and disk space.
For at least most recent episodes, including those with dynamic ads, some differences in timing between the podcast and YouTube are predictable. The intro ads at the start of the podcast episode have never exceeded 4 minutes. The sign-off added to the end of the podcast episode is also very short, about 1 minute.
I think this means we can just fingerprint the first and last ~10 minutes of the YouTube audio, and take just two ~20 second samples from the podcast episode, ~5 minutes from the start and from the end. If dejavu
can find the alignment between the first podcast sample and the first YouTube fingerprint, and the same for the later pair, we're done.
Does that make sense?
Alright here is my first take on the realignment algorithm: bauersimon@51e0531 😅. I haven't tested it yet (also because I'm lacking the respective files - need to check out the
download.py)
and it probably needs some tweaking but it contains the rough ideas of how it would work.EDIT: But that's as much as I can do for the week 😓 because I'll be gone for a few days. Will probably continue next Tuesday.
Again, amazing work! I'll have to see if in the next few days I can catch up to you! 😂
Thanks! Well it's untested and probably not functional yet... 😅
Awesome I get what you mean by the more optimized realignment algorithm! Maybe we can keep both and use the more involved one as a fallback in case the easier one fails?
Vagrant is pretty straightforward. Just install it (and some virtual machine manager like VirtualBox), change to the directory containing the Vagrantfile
and vagrant up
. That should build everything you need. You can vagrant ssh
to get into the machine and change to cd /vagrant
to get to the folder that's shared within the machine (i.e. autosync
currently). Use vagrant destroy
(outside) in case you broke something or move the Vagrantfile
, bootstrap.sh
and requirements.txt
to another folder in case you want to change what the "root" inside vagrant is.
Hey! I got Vagrant working and your playground.py to run, giving the correct result of a 300 second offset with 0.12 confidence! I'm so happy, haha.
(I'm also amazed by Vagrant. I've used VMs for many things, and this was so much easier than setting them up manually. Never again!)
Regarding this:
dejavu
seems to have problems with large audio files, i.e. fingerprinting a single10min
audio file only took a few seconds but trying a single30min
file didn't terminate even after letting it run>20min
, so we probably need to "slice" up our YouTube sections before fingerprinting them
Looks like the VM is configured to have only 4 GB of memory. Perhaps fingerprinting the larger audio file is memory-limited, and increasing the allotted memory will speed it up / let it actually finish. I haven't tested this yet. I still need to study your code to understand how it works. 😄
@bauersimon, I did some reorganizing of the directories and modules. In particular, I moved my Python code into one importable directory. I suggest you move your autosync
directory into the new critrolesync
directory under src
. Eventually, we can make this into one cohesive package.
@bauersimon, I created a new branch called autosync
and rebased your commits onto my latest changes.
I also reorganized your code into a subpackage, located at src/critrolesync/autosync
. It can be imported using import critrolesync.autosync
from the src
directory. For now at least, the autosync
subpackage is not imported directly when the top-level critrolesync
package is imported, since this would require dejavu
, docker
, etc. every time, and I'd like to be able to use critrolesync
outside the Vagrant VM.
playground.py
and testdata
were moved outside the package since they are currently being used for informal testing. Running sudo python3 playground.py
still works, of course.
The Vagrantfile
was moved to the root directory so that the whole repo is available in the VM. This is necessary since a file in the docs
directory (data.json
), as well as the podcast feed archive files located in feed-archive
, are needed by the critrolesync
package.
If you think this should be organized differently, I'm open to suggestions.
I'm quite busy till Tuesday but I'll probably manage to have a quick look at your work later to provide some feedback 😀
Thanks for continuing on - amazing stuff. I'm eager to see everything working - also automated with GitHub Workflows (did some research on that - it should definitely be doable)!
Perhaps fingerprinting the larger audio file is memory-limited...
I monitored the resources inside the VM and also of the database inside docker
and I wasn't able to find the bottleneck :disappointed:. But maybe we can somehow fix it by playing with the values - it wouldn't surprise me if there was some low level magic that would change things. Though just slicing the audio is also doable and with your optimization in mind we really don't have to fingerprint a lot.
I did some reorganizing of the directories and modules
Just had a quick look - looks nice and clean! I'd only move the bootstrap.sh
and requirements.txt
to the root directory with Vagrantfile
since they kinda belong together and define the environment.
Also something we might want to do (before we "officially merge everything to master
) is reverse the testdata
audio just to obscure it as you don't really have "copyright" on that. Just so we don't have official critical role content in the repository. It shouldn't change how the tests behave anyways.
As I said I can continue work on the matching algorithm and looking into GitHub Workflows myself only next week. Though I'm always happy to follow your progress and give feedback when I have the time!
Also glad to see you like vagrant :wink:. I've only started using it recently and yeah it's pretty awesome.
Just had a quick look - looks nice and clean! I'd only move the
bootstrap.sh
andrequirements.txt
to the root directory withVagrantfile
since they kinda belong together and define the environment.
I see your point and considered doing that myself. However, to some extent, I'm trying to keep the website side of the repo (docs
) and the Python code (src
) separate. In particular, the dependencies listed in requirements.txt
apply only to the Python code, so I thought putting it in src
made some sense. I would have kept Vagrantfile
in src
as well if I could, but my Python code depends on docs/data.json
as well as feed-archive
, so I needed to go up one level. I'm not super happy with that solution and am wondering if I could have done something with symbolic links or Vagrant configuration options instead.
I can see the argument for keeping bootstrap.sh
together with Vagrantfile
. If I can't find a way to move Vagrantfile
into src
, I may move bootstrap.sh
up to the root directory.
Also something we might want to do (before we "officially merge everything to
master
) is reverse thetestdata
audio just to obscure it as you don't really have "copyright" on that. Just so we don't have official critical role content in the repository. It shouldn't change how the tests behave anyways.
Very, very good point!
Also something we might want to do (before we "officially merge everything to
master
) is reverse thetestdata
audio just to obscure it as you don't really have "copyright" on that. Just so we don't have official critical role content in the repository. It shouldn't change how the tests behave anyways.
I wanted to completely remove the testdata
from the history of the repo (you can never be too careful! -- well, maybe this is overkill), rather than just revert the commit later, so instead I did some rewriting of the commit history, which I force pushed to the branch. We could have squashed commits later when merging instead, but I generally prefer to preserve commits to make detective work easier.
So that we still have the test files you made, I put them on Google Drive and shared the folder with you using your Salzburg email. Let me know if some other address would be better.
You can of course freely configure which folders are shared with the vagrant machine. So it would be possible to put the Vagrantfile
into src
and share it's "parent" folder in the VM. Though I would not advise that because from what little experience I have with vagrant it's always convention to have the Vagrantfile
in the project root.
One other idea I had would be to put bootstrap.sh
and requirements.txt
into a separate folder called environment/
or environment/python/
. This way we have all files that "define" the environment within that directory. I need those as well to setup the cloud environment within the GitHub Workflows and with this setup this is nicely separated from the python code.
About the testdata
: I think we had a misunderstanding. Rather than completely removing the data we could just literally "reverse" the audio data (the waveform) such that we can still keep it in the repository and use it for testing but you only hear gibberish stuff when listening to it :wink:. I think that should be enough to obscure what it actually is. I'm also fine with removing it once everything works though. Keep in mind that having them in Google Drive works for us now but not for people forking the repository in the future wondering were all these files are that playground.py
needs.
Sweet success!
vagrant@ubuntu-focal:/vagrant/src$ time sudo python3 playground2.py
testdata/youtube-slices/C2E100 YouTube - Beginning.m4a already fingerprinted, continuing...
testdata/youtube-slices/C2E101 YouTube - Beginning.m4a already fingerprinted, continuing...
testdata/youtube-slices/C2E102 YouTube - Beginning.m4a already fingerprinted, continuing...
testdata/youtube-slices/C2E103 YouTube - Beginning.m4a already fingerprinted, continuing...
testdata/youtube-slices/C2E104 YouTube - Beginning.m4a already fingerprinted, continuing...
C2E100: Podcast opening ads end at 0:01:48
All matches:
Result(name=b'C2E100 YouTube - Beginning', offset=191.70395, confidence=0.27)
Result(name=b'C2E100 YouTube - Beginning', offset=191.70395, confidence=0.27)
C2E101: Podcast opening ads end at 0:01:23
All matches:
Result(name=b'C2E101 YouTube - Beginning', offset=216.87438, confidence=0.16)
Result(name=b'C2E100 YouTube - Beginning', offset=50.52662, confidence=0.2)
C2E102: Podcast opening ads end at 0:03:05
All matches:
Result(name=b'C2E102 YouTube - Beginning', offset=115.17098, confidence=0.14)
Result(name=b'C2E104 YouTube - Beginning', offset=261.45669, confidence=0.11)
C2E103: Podcast opening ads end at 0:03:14
All matches:
Result(name=b'C2E103 YouTube - Beginning', offset=105.88299, confidence=0.18)
Result(name=b'C2E101 YouTube - Beginning', offset=175.96082, confidence=0.14)
C2E104: Podcast opening ads end at 0:01:41
All matches:
Result(name=b'C2E104 YouTube - Beginning', offset=199.36653, confidence=0.28)
Result(name=b'C2E100 YouTube - Beginning', offset=368.50068, confidence=0.23)
real 1m29.153s
user 0m1.716s
sys 0m1.040s
Focusing on just the opening ads and using episodes C2E100-C2E104 as guinea pigs, I fingerprinted the first 6 minutes of the YouTube audio, and pulled a 20 second sample from 5 minutes into the podcast, after I knew ads would be over. All of the calculated times when the ads end are correct!
Fingerprinting these 5 episodes took some minutes and produced a 60 MB tar file, As you can see from the printout, actually performing the matching against a database already prepared with 5 fingerprints took about 18 seconds per episode.
There are some odd things here, though. Two matches were returned in each case. For C2E100, the second was identical to the first match (both correct). For C2E102-104, the second match is incorrect and has barely less confidence than the first match. Strangest of all, C2E101 actually had greater confidence in an incorrect match, but because it returned that one second (dunno why) and my code assumed matches would be sorted by confidence, I got the right answer. I listened to the segment in the wrong episode that the clip supposedly matched better, and I couldn't hear any resemblance.
By increasing the podcast sample from 20 seconds to 30 or more, perhaps we can ensure there is only one, clearly best match.
My playground2.py
has the process of downloading and slicing files completely automated. For slicing, I used ffmpeg
, which is very fast and allows the stereo audio to be converted to mono, like you suggested. I initially tried to use pydub.AudioSegment
, but it would always crash when trying to merely initialize with a full length episode.
This is exciting progress!
One other idea I had would be to put
bootstrap.sh
andrequirements.txt
into a separate folder calledenvironment/
orenvironment/python/
.
I like this. I'll do it now.
About the
testdata
: I think we had a misunderstanding.
You've got that right! I totally misunderstood. 🤣 But with the new code in playground2.py
, which retrieves any episode on demand, I think having audio files checked into git isn't necessary.
This is exciting progress!
Indeed amazing work! 🤗
There are some odd things here, though. Two matches were returned in each case. For C2E100, the second was identical to the first match (both correct). For C2E102-104, the second match is incorrect and has barely less confidence than the first match.
We can circumvent this very easily. Just use one database (i.e Matcher
) instance per episode and store the fingerprints per episode in separate files. This way there is only one correct match in the database and we will only be interested in the offset. Then matching should also be quicker because there is only one fingerprint candidate to be matched 😇
My version of the complex matching algorithm should be doing this already. So feel free to have a look at that.
We already know which episode corresponds to which episode so we don't need the matching to solve this problem as well, is what I'm trying to say 😄
We can circumvent this very easily. Just use one database (i.e
Matcher
) instance per episode and store the fingerprints per episode in separate files. This way there is only one correct match in the database and we will only be interested in the offset. Then matching should also be quicker because there is only one fingerprint candidate to be matched 😇
Yes, great idea!
I just re-ran the whole thing from scratch, including first tearing down the VM and deleting the database archive and every audio files. Once I had the VM up again, I ran the whole of playground2.py
in one go, and it completed without any problems. Very gratifying. Downloading the files, slicing them, fingerprinting them, and matching took a total of 11 minutes. About half that time was downloading.
I got slightly different results in a few cases. I'm not aware of any stochastic aspects to the algorithm, so I will guess something was a little out of tune in my last VM. This isn't too surprising since I tried a lot of things I haven't even mentioned, including building ffmpeg
from scratch because I thought I needed a different encoder... all that was unnecessary.
The new results are better vis-a-vis confidence on first matches versus second. Plus, the C2E101 result makes more sense. Here's a diff:
C2E100: Podcast opening ads end at 0:01:48
All matches:
- Result(name=b'C2E100 YouTube - Beginning', offset=191.70395, confidence=0.27)
- Result(name=b'C2E100 YouTube - Beginning', offset=191.70395, confidence=0.27)
+ Result(name=b'C2E100 YouTube - Beginning', offset=191.70395, confidence=0.19)
+ Result(name=b'C2E103 YouTube - Beginning', offset=260.99229, confidence=0.11)
C2E101: Podcast opening ads end at 0:01:23
All matches:
Result(name=b'C2E101 YouTube - Beginning', offset=216.87438, confidence=0.16)
- Result(name=b'C2E100 YouTube - Beginning', offset=50.52662, confidence=0.2)
+ Result(name=b'C2E100 YouTube - Beginning', offset=50.52662, confidence=0.12)
C2E102: Podcast opening ads end at 0:03:05
All matches:
Result(name=b'C2E102 YouTube - Beginning', offset=115.17098, confidence=0.14)
Result(name=b'C2E104 YouTube - Beginning', offset=261.45669, confidence=0.11)
C2E103: Podcast opening ads end at 0:03:14
All matches:
Result(name=b'C2E103 YouTube - Beginning', offset=105.88299, confidence=0.18)
Result(name=b'C2E101 YouTube - Beginning', offset=175.96082, confidence=0.14)
C2E104: Podcast opening ads end at 0:01:41
All matches:
Result(name=b'C2E104 YouTube - Beginning', offset=199.36653, confidence=0.28)
- Result(name=b'C2E100 YouTube - Beginning', offset=368.50068, confidence=0.23)
+ Result(name=b'C2E103 YouTube - Beginning', offset=28.65342, confidence=0.14)
dejavu
is quite picky when it comes to it's dependencies. With my current setup I get warnings during the installation even though everything works in the end.
I wonder if dejavu
's dependencies are pinned because any one of them changing has the potential to invalidate existing fingerprint databases. For instance, a slight change in the way Matplotlib renders the spectrogram might lead to entirely different hashes. If this is possible, we should probably pin some things too, or use dejavu
's pins, or keep a record of pip freeze
as insurance.
I just pushed 0db054d, which updates playground2.py
with calculations of all four timestamps for each podcast. My toy code is pretty ugly, but it's a working demo that gives the right answers!
C2E100
['youtube', 'podcast', 'comment'] confidence
('0:00:00', '0:01:48', 'hello') 0.2
('1:53:50', '1:55:38', 'break start')
('2:18:34', '1:57:55', 'break end')
('3:37:44', '3:17:05', 'IITY?') 0.25
C2E101
['youtube', 'podcast', 'comment'] confidence
('0:00:01', '0:01:23', 'hello') 0.26
('2:26:16', '2:27:38', 'break start')
('2:45:33', '2:29:56', 'break end')
('3:38:36', '3:22:59', 'IITY?') 0.16
C2E102
['youtube', 'podcast', 'comment'] confidence
('0:00:00', '0:03:05', 'hello') 0.14
('1:39:20', '1:42:25', 'break start')
('1:59:24', '1:44:42', 'break end')
('3:26:44', '3:12:02', 'IITY?') 0.28
C2E103
['youtube', 'podcast', 'comment'] confidence
('0:00:00', '0:03:14', 'hello') 0.2
('1:50:29', '1:53:43', 'break start')
('2:07:01', '1:56:00', 'break end')
('3:31:15', '3:20:14', 'IITY?') 0.2
C2E104
['youtube', 'podcast', 'comment'] confidence
('0:00:00', '0:01:41', 'hello') 0.29
('1:14:11', '1:15:52', 'break start')
('1:31:04', '1:18:10', 'break end')
('3:55:19', '3:42:25', 'IITY?') 0.15
This is very exciting!
I tried varying the duration of the samples from the podcasts (one taken at the start of each episode, and one from the end) to see how it affected accuracy, confidence, and speed. I expected the accuracy and confidence to go down with significantly shorter sample durations, and for the confidence to soar up to 100% with longer samples, but I was surprised by the result:
The total matching times plotted in blue do not include fingerprinting or slicing. Slicing takes just a few seconds regardless of the sample duration. I only tried fingerprinting 6 minute segments of the YouTube audio, from the beginning and end each.
I was surprised to find that the perfect accuracy never wavered, until I tried 2-second samples (not shown), which gave bad results. Confidence actually decreased for some reason when the sample duration was longer than 10 seconds. Total matching time increased steadily with sample duration, as expected.
From this little test, I would say that, technically, 5-second samples win, with 10-second samples tying for confidence, but taking about 56% longer. However, I only tested 5 episodes, and I am guessing the 10-second sample duration may be more robust to rare occurrences where a 5-second sample would not be unique enough. So, I favor 10-second samples, even if they take longer to match. It's still pretty fast, taking only about 15 seconds per episode!
Note that I had fingerprints from all 10 YouTube audio segments (beginnings and endings of 5 episodes) stored within one database. I think this could be easier for file management, backups, and sharing the database with GitHub Actions. It likely makes matching a bit slower, but it also avoids the overhead of starting up a new Docker container and database, optionally writing an exported backup to disk, and tearing down for each match. I'm not sure which approach (one database .tar per match/episode, or one .tar containing all episodes) would ultimately be faster, but I think the simplicity of one file is pretty valuable. A single file could be preserved as a GitHub Actions workflow artifact pretty easily. (Of course, the many .tar files of the one-database-per-episode could be bundled together too as a last step, but that's even more work.)
I started going through the episodes in order, beginning with C2E20, the first episode to have dynamic ads. I added the ability to programmatically write autosync-ed timestamps to data.json
, and I have been saving and manually verifying the autosync-ed timestamps as I go. So far, I've gone up to C2E54. Right now the fixed timestamps are on the autosync
branch only, but I will probably copy them over to master
soon so that they can go live on the website.
I encountered just three problematic episodes so far:
I also found that I could get away with shortening the segment fingerprinted at the end of the YouTube audio, from 6 minutes down to 2, which saves time and space.
All of this is still being done in playground2.py
, which is a terrible, ugly mess compared to your carefully crafted code! 🤣
Alright, I'll concede that trying to keep all fingerprints in a single database was a Bad Idea. 😅 I fingerprinted all of C2E20-C2E140 (which produced a ~1 GB .tar file); attempts to match using it frequently lead to "soft locks", and when that doesn't happen, performing a single match takes ages.
Now I'm trying what you suggested before, having a separate .tar file for each episode. Fingerprinting seems to be going somewhat slower, but matching is super fast. Since fingerprinting only needs to happen once, this seems to be the way to go. 👍
OK, 365d945 is working pretty well with each episode's fingerprints saved to its own .tar file.
Like the other live episodes I mentioned, C2E73 & C2E97 also have trouble finding the first timestamp thanks to the crowd.
I think I'd like to implement a system where the slice times can be configured differently for individual episodes. Those parameters could be stored in another JSON file. This will allow us to work around things like the live episodes, or other idiosyncrasies that might make one or a handful of episodes fail with slice times that otherwise work for most episodes. Campaign 1 is likely to need different slice times anyway.
By the way, the total size for the fingerprints of C2E20-C2E140 when stored in separate files is very similar to (actually a little smaller) than when all fingerprints were stored in a single .tar. Also, the directory of fingerprints for the 121 episodes can be compressed down to a 165 MB .zip file, which is pretty efficient.
I fixed every timestamp for C2E20 and later, thanks to the power of (semi-)automation! I'm so happy. I also cleaned up playground2.py
a bit. It's still a work in progress.
I want to work on older episodes, but I encountered a new problem. When I try to auto-sync C2E1, I get nonsense. My hunch is that this is caused by the YouTube audio having one sample rate (44.1 kHz, the default in dejavu
, and used for all C2E20+) and the podcast audio having another (48 kHz).
C2E1-C2E19 are part of the old Nerdist podcast feed for Campaign 1, and the videos belong to the Geek & Sundry YouTube channel rather than Critical Role. These differences in ownership seem to also come with differences in file format, as well as inconsistencies. C2E1 and C2E15 are different from the others in this set in that the YouTube audio files downloaded by youtube_dl
with default settings have the sample rate used by later episodes (44.1 kHz). In contrast, the others have the less common sample rate (48 kHz), which they share with all podcasts in this set; these others also save with an .opus
file extension when downloaded by youtube_dl
with default settings, rather than an .m4a
extension. I can probably request an .m4a
version instead, and perhaps it will have the more common sample rate.
The problem remains that we may have some episodes that differ in sample rate between YouTube and podcast. I haven't started looking yet to see if dejavu
can handle this.
Wooow you've been crazy busy! (For some reason I didn't catch the latest notifications on my phone). I'm amazed with the progress. Good stuff - good stuff! I'm still quite busy with other things so sorry for not being that helpful the last few days.
From this little test, I would say that, technically, 5-second samples win, with 10-second samples tying for confidence, but taking about 56% longer.
I would've really thought that longer samples would mean higher confidence but I guess not :sweat_smile:. Though nice that you tested that!
I fixed every timestamp for C2E20 and later, thanks to the power of (semi-)automation!
Awesome :relaxed:
When I try to auto-sync C2E1, I get nonsense. My hunch is that this is caused by the YouTube audio having one sample rate.
Weird... I really thought dejavu
could handle that :thinking:. Might need some different configuration?
I'd like to get going with the GitHub Actions stuff. Not sure yet how that works yet if I try this on my fork but I guess there's only one way to find out :wink:.
Could you give me some hints on how to use what tooling you built already? I'd like to propose the following workflow (from within the cloud instance of GitHub Actions):
Realignment Workflow (run every week/month?):
xml
files from Anker.fm that you talked aboutdejavu
dependenciesNew Episode Workflow (run when specifically triggered with new json
containing the YouTube offsets and slicing instructions)
dejavu
dependenciesI'd like to get going with the GitHub Actions stuff. Not sure yet how that works yet if I try this on my fork but I guess there's only one way to find out 😉.
Wonderful! I've used GitHub Actions quite a bit for other projects, and, yes, it should be possible to get workflows to run on your fork.
Could you give me some hints on how to use what tooling you built already?
This should work, I hope:
cd critrolesync.github.io
vagrant up
vagrant ssh
cd /vagrant/src
sudo python3 playground2.py
Changing this line should allow you to control which episodes the code runs on. Uncommenting this line will give more information about what is happening.
Random hanging has been an annoying problem for me from the beginning. The script will just get stuck in random places at random times. Pressing Ctrl+c
(multiple times if necessary) can kill the process. Generally, rerunning the script works fine, but occasionally a Docker container is left running, and then Python will complain about a port already being in use. When that happens, you need to kill the Docker container manually. Use sudo docker container ls
to get the IDs of running containers, and then sudo docker container kill <id>
.
Even with this intervention, I've had Docker processes running in the background, hogging my CPU. Restarting the VM helps then:
logout
vagrant halt
vagrant up
By far, the slowest step of fingerprinting (besides downloading -- sometimes a YouTube download will move at a snail's pace, and quitting and restarting will get you back to high speed) is inserting the hashes into the Postgres database managed by Docker. Actually creating the hashes for the audio is pretty fast. Furthermore, my impression is that hangups tend to happen on steps involving interaction with the database. Overall, it feels like the database is unstable and somewhat slower than it needs to be (though some of my recent changes made slowness less of an issue -- it's not too bad now). I'd love to find some performance improvements here.
I'd like to propose the following workflow (from within the cloud instance of GitHub Actions):
I like these a lot!
I'm not sure that GitHub Actions can commit to the same repo it's working on, but even if it can, my preference for now would be to have these workflows do no committing, at least not to master
(opening a pull request would be pretty cool, though). Rather, once they finish, workflows can make files produced during the process available for manual download as artifacts. I want to inspect every change to data.json
before it is published, and I'd like the workflow to generate reports for me to make inspection easier:
I've noticed that when there is a change to the dynamic ads, all podcasts tend to receive the same change. All of their total durations (info contained in the feeds) will increase or decrease by a fixed amount due to the ads changing in the same way for all episodes (e.g., in this feed diff, each of the "Seconds"
increased by 12-13 seconds). I want to see that dejavu
came up with timestamp changes that agree with this fixed amount. If I see that, I won't feel the need to manually double check every timestamp (which is very time consuming!).
Since the fingerprint database is a large binary file, I want to store that somewhere other than in the GitHub repo. git
is not the ideal tool for backing up large binary files. Rather, we may be able to find a solution where the fingerprints are saved on a storage provider like Google Drive, and the GitHub Actions workflow has access to it.
Great progress today:
ffmpeg
can easily convert the output to the standard 44.1 kHz. Very simple, with no noticeable speed cost..opus
files I mentioned yesterday.0:00:00
, it was throwing things off. This was the real cause of the live shows not syncing properly, rather than the white noise of the crowd cheering. I made an incorrect assumption.On to Campaign 1!
Furthermore, my impression is that hangups tend to happen on steps involving interaction with the database.
I can try to get rid of the docker container all together and install postgres natively in vagrant. Maybe that helps with the hangups.
5 is rearing its ugly head for Campaign 1, ugh.
Fixed by 3a81aecf. Autosyncing is working for Campaign 1, and I am slowly working my way through it.
With aceda64, playground2.py
will create just one database Docker container and reuse it for each Matcher
object, so that only one Docker container is created for the entire script, rather than setting up and tearing down a container for each episode every few seconds. This greatly speeds up matching if downloading and fingerprinting are already done.
This change also seems to have improved stability somewhat, though not 100%. Coincidentally, I discovered that if I SSH into the VM using a second terminal while it's stuck and type a command like top
, it will get unstuck. I'm not sure what's going on there, but I'm happy to have a workaround that lets it continue, rather than needing to start over!
I've been making more progress on cataloging episodes from Campaign 1, as well as all of Exandria Unlimited.
Ahhh perfect! I was half way into the native installation inside vagrant but ran into some complications that I didn't manage to solve yet. But just running the database container continuously seems like a great solution.
Commit 55607d2 now makes it possible to run the script without Vagrant, using only Docker. This seems to be much more stable and even faster than before (after initial container images are downloaded).
With Docker and Docker Compose installed on your machine, just run
cd critrolesync.github.io
docker compose build
docker compose run python /bin/bash
python playground2.py
The original method using Vagrant still works:
cd critrolesync.github.io
vagrant up
vagrant ssh
cd /vagrant/src
sudo python3 playground2.py
I think that if I understood Docker container networking slightly better, the new code would have been quite a bit simpler (e.g., all of this could be removed, and Database
could handle the container creation just as it does with Vagrant; perhaps Docker Compose could be eliminated altogether). However, after banging my head against the keyboard for several hours, I'm settling for something that works, even if it's a little more complicated. (I felt like I was probably just one or two parameter settings away from it working. Grrr, frustrating, haha!)
I think that if I understood Docker container networking slightly better, the new code would have been quite a bit simpler [...] (I felt like I was probably just one or two parameter settings away from it working. Grrr, frustrating, haha!)
Turns out my settings were all correct! I just needed to introduce a delay with sleep
to give the spawned container enough time to connect to the network. After fixing that (bae02cf), I was able to make the simplifications I wanted (0ee31ad).
I added an entrypoint command to the Dockerfile
(caef05b), and I changed the name of the main container, so the commands for launching via Docker only (without Vagrant) have changed a bit:
cd critrolesync.github.io
docker compose build
docker compose run autosync
I also decided to pin Python dependencies to versions known today to give correct results; we don't want Matplotlib changing its colormaps or whatever and suddenly all our archived fingerprints are no longer accurate! I tested that everything works with the latest versions of packages in Python 3.9 today, and pinned to those versions (e701727).
Hi @bauersimon,
I renamed playground2.py
to __main__.py
and moved it into the autosync
subpackage. This means it can now be invoked using pythom -m critrolesync.autosync
. I update the Dockerfile
to use this command, so nothing really changes if you are running it through Docker.
I merged all of the changes so far into the master branch (#18). Of course, I still think there is more to do. In particular, I want to be able to pass command line arguments to the script so that one does not need to edit the file to configure which episodes it runs on or with what settings (e.g., whether or not to re-fingerprint). Doing that would allow a GitHub Actions workflow to be built which accepts these parameters and runs synchronization in the cloud. There is certainly more refactoring that could be done too. I'm going to leave this issue open since I don't consider it completely resolved yet.
By the way, I built a GitHub Actions workflow, archive-podcast-feeds.yml
, which checks for podcast feed changes every hour and opens a pull request automatically when it finds any. It would be wonderful if we could bring a similar level of automation to the execution of autosync
. I foresee some difficulties there, since a certain level of human intervention may always be needed (e.g., initial determination of the YouTube timestamps, documentation of YouTube URLs), but I am hopeful we can do more.
Awesome progress! I'm happy that you got this far! Glad I could be of some help. I'm quite limited on time still so I'm not sure when I manage to get some more work done on this.
It's DONE! I've finally finished cataloging all of the C1 episodes, and autosync was a huge help. Ready for Campaign 3!
Since we last communicated, I added command line argument parsing (b9a8179), so now it will be possible to run docker compose run autosync C3E1
(after manually cataloging the YouTube timestamps, of course) to determine the podcast timestamps. It accepts flags for re-downloading audio sources, re-slicing, and re-fingerprinting.
A sophisticated alternative to human labor (#1) for syncing new and old episodes would be to use an automated algorithm executed weekly:
I could possibly build something like this to run in the cloud, particularly in my programming language of choice (Python), but I haven't found a library yet for performing the audio matching (step 3). If you know of one, please let me know!
This approach would hopefully allow all future episodes to be synced automatically, without a human needing to manually catalog timestamps.
This may also be a viable solution to #6.