Komet / MediaElch

Media Manager for Kodi
https://mediaelch.github.io/mediaelch-doc/about.html
GNU Lesser General Public License v3.0
832 stars 93 forks source link

[Feature Request] Crunchyroll as a scraper #1541

Open tamodolo opened 1 year ago

tamodolo commented 1 year ago

The main idea is to use CR API to get info for series. I don't know how much of this is possible. I know that yt-dlp can read some data from them. Probably it's a good way to start with this.

bugwelle commented 1 year ago

Hi,

does Crunchyroll have an official API? I didn't find one and as far as I've read, it requires a Premium Account (though I haven't found that, either).

Regards, Andre

tamodolo commented 1 year ago

Hi!

CR have an official API (that they use with their apps on console and android). This API is undocumented and I think the best way to implement that is by taking a look on how yt-dlp works (I think they probably parse the html as it was partially broken when CR updated to new site. The only app I know that trully uses the console API is the KODI one as it didn't broke with webpage update: https://github.com/MrKrabat/plugin.video.crunchyroll

About premium account: as CR is a stream service like netflix so this is necessary only to watch content. The API have a limited access when a normal account log in. I don't know how limited but just allowing to see the metadata would be enough.

The main reason for supporting this even with the need to premium account is that CR have localized metadata for many shows. I was looking the json yt-dlp dumps and have links to thumbs in many resolution, descriptions, season number (but I think this isn't that acurate. CR changed how they handle this recently and I need to take a look again). The problem is that yt-dlp is incomplete. Noticed that when I tryed to write a json parser to .nfo... series info is completly absent.

For people dumping CR, yt-dlp can record the unique ID for the episode on filename. This can be used to be accurate when pulling metadata. For anyone else, name search is possible (as the kodi plugin can do that)

bugwelle commented 1 year ago

Awesome! Thanks for the links and pointers. :-)

I will have a look at it, but can't give you a time frame. Depending on how easy it is to use and test.

For testing purposes (to be notified if something breaks), could you provide me with 2 - 3 TvShows and links to their web page? I need their IDs and a page where I can look up the correct details. If you know of some differences between certain shows (e.g. different titles in different languages), IDs/links of those TV shows would be very much appreciated.

The most time consuming task when adding a scraper is finding proper TV shows for testing and settings up those tests. Writing the actual scraper is rather easy if the page has a proper API. :-)

Kind regards, Andre


It also seems that the Kodi plugin requires a Premium account:

WARNING: You MUST be a PREMIUM member to use this plugin!

So I'll have to see how to test it. :)

tamodolo commented 1 year ago

No problem. I'll try to explain what I currently understand from how CR handle things at webpage level as I don't know how the KODI plugin works. (I'm brazillian so prints will be in portuguese...)

image Look at the URL. That part I marked is the unique ID. For this example this is a series ID.

This is the entire URL: https://www.crunchyroll.com/pt-br/series/G24H1N3MP/mushoku-tensei-jobless-reincarnation For practical use you can define thing as follows:

Main page: https://www.crunchyroll.com page language: /pt-br (I think you can grab language description using this) content type: /series (I think they don't use anything else actually. I'll try to find some movie example for this but I am mostly certain that this don't change at all) unique ID: /G24H1N3MP (this can be a series, a playlist or a single episode. For episodes the subtitled version have all the languages. For dubs it'll have a diferent ID for each language. I don't think this is important for metadata only. Probably by setting the desired language code one could dump any language description. I don't know if this is correct as I can't test this. Also, you can set yt-dlp to write the episode ID on the filename and you could use that to parse the episode metadata with 100% precision. The other method is the kodi default that could return errors as I don't know if CR respect the correct season number.)

Anything after the code isn't important. The url will work just with the described like this: Mushoku Tensei: https://www.crunchyroll.com/pt-br/series/G24H1N3MP

For movies I don't think they do a main page and instead go direct to the movie itself. Probably can be considered as episode unique ID.

For series with more than one season CR do this:

image Some time ago they showed all available dubs within this and now they show that as a selection inside the video player. The structure of IDs didn't change at all as selecting any dub will take you to the dedicated page for that ID.

Series to test:

A normal series: Mushoku Tensei: https://www.crunchyroll.com/pt-br/series/G24H1N3MP/mushoku-tensei-jobless-reincarnation

A series with a movie: Demon Slayer: https://www.crunchyroll.com/pt-br/series/GY5P48XEY/demon-slayer-kimetsu-no-yaiba

A series with exclusive entry for dubs: My Hero Academia sub: https://www.crunchyroll.com/pt-br/series/G6NQ5DWZ6/my-hero-academia My Hero Academia dubs: https://www.crunchyroll.com/pt-br/series/GYNV9DP2R/my-hero-academia-dubs

A movie only: Origin- Spirits of the past: https://www.crunchyroll.com/pt-br/watch/GYE52493R/origin-spirits-of-the-past

A series available for free accounts: Horimiya: https://www.crunchyroll.com/pt-br/series/G9VHN9P43/horimiya

bugwelle commented 1 year ago

Hi,

I had a look at the scraper on the weekend and there is one thing that's bothering me: It seems that the Kodi plugin uses a hard-coded session ID: https://github.com/MrKrabat/plugin.video.crunchyroll/blob/main/resources/lib/api.py#L44

If so, the sentence in the README (regarding that you need to be a premium user) seems like a disclaimer "just in case". Personally, I don't want to use API tokens from other projects.

To me, it's a blocker. Unless premium users can generate their own API key (which I haven't checked, though), I think I won't implement a Crunchyroll scraper.

On the other hand, I found a library that suggests that login via username/password is do-able: https://github.com/crunchy-labs/crunchyroll-rs/blob/1055fca9864e5aaf7517e09e600d1e35a3d03c96/src/crunchyroll.rs#L558 If that's possible, a CrunchyRoll scraper is doable, but needs more work, because MediaElch does not have the internal structure for user-name/password combinations, yet. We currently get all tokens at start-up. As far as I can see, the token expires rather fast so we need a better refresh-mechanism first. Also we need a secure storing mechanism for username/passwords. Plaintext it a no-go (which is currently used for some tokens).

Regards, Andre

tamodolo commented 1 year ago

Hi!

Makes sense. As a user of Kodi plugin the info it gets is actually my info. Even queued series are mine. This token may be something hard coded by CR thenselves as I once got a script that decoded encrypted subtitles. This could be the key to decrypt things. This also could be why this plugin don't break with everything else trying to decode CR like yt-dlp. I wonder what that token is for...

That aside, as CR is now temp baning IPs if you try to access it to many times in a short period (they ban you for 30 minutes or so after 40 requests), my script to list series no longer works and I rewrited it to just grab info in the series page (finished this yesterday). I was a bit surprised that I could see info without going deeper. That said, the possibility of grabing info without the need to login exists. I tested this with yt-dlp without informing any username and still be able to retrieve information. Try this command:

"D:\Anime Tools\yt-dlp\yt-dlp.exe" --flat-playlist --print-to-file %(season)s;%(episode_number)s;%(webpage_url)s "list.cvs" https://www.crunchyroll.com/pt-br/series/GKEH2G428/bofuri-i-dont-want-to-get-hurt-so-ill-max-out-my-defense --paths "C:\animetemp1"

Change directories to fit your need. I still can't get series description with this method as this isn't the focus of yt-dlp.

I'll keep you informed if I find some other unknow behavior.