Spike out inferring intros from subtitle tracks

Reducing the number of clicks to contribute is important, and if you're not watching a show that has timestamps yet, it would be nice if the extension could just find and skip the intro for you.

One place we could potentially find the intro location would be the text tracks/subtitles - When there's over 90s of silence, the intro could be hidden there. To increase accuracy, we'll have to use other context clues to narrow that prediction down to be accurate (Previous intro duration, 3rd party services, looking for the show title to mark the end, etc).

For now though, lets just add a setting titled Experimental: Predict Intro Timestamps. When toggled, show possible time ranges that the intro somehow, whatever is easiest (notification, text overlay, etc).

Once we've confirmed the potential for finding intros and we have an idea of its accuracy and how many false positives this results in, we can move onto improving the accuracy. If it doesn't seem promising, we'll scrap it.

Experimental: Predict Intro Timestamps setting
Display potential time ranges

Research update 0

yt-dlp cannot be executed in the browser, so if we wanted to use it directly we would need a Node.js microservice dedicated to getting subtitles ⛔
I read into the source code for yt-dlp to see if we can do the same in JS but with only what we need (subtitles from anime-skip sites). A few observations:
1. The method of getting subtitles differs based on the site - it is interacting with each respective service's API, all of which have not been officially made public. If we were to do a JS rewrite, maintaining this could be a pain as it would break if any given platform changes something. Tldr; it's crusty
2. The basic approach taken is to request metadata about the media and (I am not yet sure how) form a URI, e.g. a CR request resulted in this URI, click me to see the .ass subtitle file.
FFmpeg can extract subtitles from video files, and I also read about a really really cool WASM port of FFmpeg BUT afaik we would need to download the entire video file in the browser in order to extract the subtitles which I am not sure is possible nor practical...
A 3rd party API we might be able to leverage is called opensubtitles. This has a HUGE database of subtitles in varying languages with a public API and a couple wrappers on npm:
- opensubtitles.com: this one is referenced in the API docs but is basically a ghost town 😆
- opensubtitles-api: this one is more popular and seems to be usable in the broswer via browserify.
- Lastly and obviously we could roll our own very limited opensubtitles client.
- BUT downloads are rate limited:
  
  Your consumer can query the API on its own, and download 5 subtitles per requesting IP's per 24 hours

Research update 1

It is a bit surprising to me that we cannot get subtitles directly from the platforms; this has the obvious downside that we need to maintain and support each platform individually, but the upside that we do not rely on a 3rd party API that may (or may not) have the subtitle file that we need. So I looked a bit further into how we might get access to the subtitles directly, starting with CR Beta.

I noticed that if you change languages, a request is made with the subtitle data in plaintext as a response!!!
This lead me to see if we can access the network traffic from the browser extension and the answer is... probably! I found this for Chrome, but I do not know if we can use that - how do we serve the extension on both chrome and firefox, would we have access to this API?

The only other anime streaming service I pay for is Funimation, so I moved on to see if I could find subtitles in it's network requests and...

Success! It also sends them in plaintext in a VTT formatted file, should not be an issue! Below is a screenshot:

This is really great news - I suspect that unless the anime is hardsubbed, we might be able to sniff the network requests and find what we are looking for! Regardless, moved on to a fallback option, that being opensubtitles.

opensubtitles.org

6053173 subtitles available
You can get every page as XML by adding /xml, we could use this with something like fast-xml-parser. They request that you contact them before using this feature to write your own software; I am confident that at most they would just ask that we have a VIP membership, which is 10 euro / year. I have no problem paying that myself if their service fits our needs perfectly.
Limit: 1 IP/max 200 subtitles/24 hours
They have loads of supported subtitle formats, there are various subtitle parsing libs on npm but it shouldn't be too hard to DIY one or two if need be.
JS client wrapper with which we would use browserify.

opensubtitles.com

"Not production ready" REST API, I am pretty sure it serves data from the same database as .org.
This was the API that I previously mentioned.
Node.js client wrapper, less sure whether this is viable with browserify.
Limit: download 5 subtitles per requesting IP's per 24 hours

Interesting. I had a feeling getting subtitles wasn't gonna be easy. Microservices aren't out of the question, but it seems like it would be a pain to maintain, just like a JS implementation per site. As of now, the third party APIs seem like the way to go, that right?

What are your thoughts at this point? Still think this is a path worth going down, or is it still too early? I'll defer judgment to you since you've done all the research.

Microservices aren't out of the question

I'd say we can consider committing to that once we have confirmed that we can deliver some accurate timestamps once we have the subtitles...

third party APIs seem like the way to go, that right?

Yeah I intend on investigating open subtitles more while also trying to find other possibilities.

What are your thoughts at this point? Still think this is a path worth going down, or is it still too early?

I am still hopeful that we can make a great feature out of this with not too much effort, but I'll keep you updated!

Note: I just updated the research comment and am really excited to get your input on it !

Research update 1

Hmm, good to know that they're available at some endpoint via a network request. In chrome extensions, there are 2 ways to read network requests:

The web request APIs. However these have been severely limited by manifest v3's restrictions. Specifically, they can't read responses. Might not be a problem, if we see a request that ends in .vtt and can make the request ourselves and get the captions. https://developer.chrome.com/docs/extensions/reference/webRequest/
Using a script block to inject JS into the page's context and overwrite the fetch and xhr calls. With that we can read the response for every request. That approach doesn't require additional permissions, and if we're already injecting code into the pages that make the request, we don't need to request more host permissions. The downside is that this is difficult to do and can be blocked by pages CSP headers. Not to mention privacy issues with us accessing all the responses for a page's network requests.

Since the first option requires additional permissions, the second approach would be preferred. That said, I'd like to avoid dealing with/intercepting the network requests - it will be difficult to do, hard to maintain, and it will not last long as sites adopt better CSP practices.

I'm not going to maintain that or accept PRs for that, so I'd recommend you stop going down that route. That doesn't mean this still won't work. Is the URL that loads the captions consistent? If so, can we just look that URL up and make the request ourselves?

opensubtitles.org

I'm curious how many anime they have translations for. Would it be possible to compare a sample of episodes that are available on anime skip and see if the data we store on episodes will work with their API and return actual results? So we can get an idea of what percent of episodes this approach would work for?

They request that you contact them before using this feature to write your own software

If it's viable, I'll front the costs and reach out to them. They might be a useful API for my future plans for Anime Skip as well... 😏

The web request APIs.

Is the biggest drawback to this the web request permissions? If so, could we use optional permissions and only ask for the additional permissions if the user enables Experimental: Predict Intro Timestamps?

I ask because this seems to be the easiest way to get the subtitles for every anime that has subtitles available. Tools like yt-dlp try to build the URLs, while we, as the browser extension, can access them. If I am missing some complexity please call me out for it, but it seems like we can use one regex for each supported platform that is looking for the subtitle request, then we make it ourselves. We won't have access to a lot of headers via the web request API, but we only need the URL; it looks like authorization related IDs are passed as query params, e.g. crunchyroll beta URL:

https://v.vrv.co/evs/46e6d2eb853a0ce66af36372c5d1b1a5/assets/55eptj8r3io0rbh_114153.txt?Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cCo6Ly92LnZydi5jby9ldnMvNDZlNmQyZWI4NTNhMGNlNjZhZjM2MzcyYzVkMWIxYTUvYXNzZXRzLzU1ZXB0ajhyM2lvMHJiaF8xMTQxNTMudHh0IiwiQ29uZGl0aW9uIjp7IkRhdGVMZXNzVGhhbiI6eyJBV1M6RXBvY2hUaW1lIjoxNjQxOTg1NDUwfX19XX0_&Signature=qY6m45FImtZBK8xqFJHPmP28CUEkkxtpXnS8g6SzDhkgIAnvAwy3kpaIyJ5NPNgOHomadisu2DBpqWV2AlcUqcf~VbjH8Z4kMLbmQ7nLKlCWHOVurZu5Wx8Mm9vTWLohfbC7JH-p0r-NBmp2VxC-erIWet4PDIbUZv2De9v~fLcJUhVUnhFomhxqVe20fPxZIsjeDpP69xFU9GRCeMJey2LhevdTHj7vB~l2fk9pcg9As17FMJExcbbFg6s5jcEnkBe082ZJKd~hIo2M4um8ShriHURTY1o20uD6fHhoezegks~7gZ7YtIJaefhS2tjGNbRhXbEmVRkAMuUHVQ6qfw__&Key-Pair-Id=APKAJMWSQ5S7ZB3MF5VA

Using a script block

I totally see your points as to why this is less than ideal, thanks for outlining all this!

I'd recommend you stop going down that route

Just to clarify, is that route = any solution that involves listening to network requests? I still have some hope for option 1, but other than that I am perfectly happy to ditch the network requests ideas entirely!

Can we just look that URL up and make the request ourselves?

Yes they are consistent, but making them is tricky... as you can see for the CR beta above as an example. Funimation is a bit easier than CR, but there are still a couple IDs that I don't know how to get. Even if we figure it all out and are able to build them, they could change in the future and that could be super annoying. In a nutshell, we probably could with a fair amount of effort, but with this option we also get the upside that any episode of anime that has subtitles, we could get.

see if the data we store on episodes will work with their API and return actual results?

If by data we store you mean stuff like anime name, season & episode numbers, I totally get what you mean and my initial response is yes we can get results! We can make something like the below: https://www.opensubtitles.org/en/search/sublanguageid-eng/season-1/episode-2/moviename-attack+on+titan/xml Then parse that response, pick a subtitle file and download.

what percent of episodes this approach would work for?

Interesting, I see how this is useful information and I will try and get some figures for you!

😏

😮

anime-skip / player