TheEnigmaBlade / holo

Episode discussion bot for /r/anime.
https://reddit.com/r/anime
MIT License
82 stars 24 forks source link

[WIP] Daisuki scrapper #13

Closed NewLunarFire closed 8 years ago

NewLunarFire commented 8 years ago

Hi. So I wrote a preliminary scrapper. I couldn't test it because I couldn't find a suitable database. I've verified the code I wrote through the python console.

There might still be bugs. I'll weed them out if I can get holo running on my computer, but I would appreciate if someone with an already working setup could test this.

TheEnigmaBlade commented 8 years ago

Thanks for submitting a pull request! Just to cover a few points, including your question on #9:

The date stored by every Episode is the publishing date. Some sites, like Crunchyroll and Funimation, add episodes to their systems early to create "coming soon" messages with timers to the episode airing. Other sites publish them to their API when the episode is released. If a publish date is missing from an API, simply fill it in with new UTC-aligned datetime instance for the current time (datetime.utcnow()).

In regards to the code you submitted, I would like to avoid HTML scraping as much as possible. Support for parsing HTML only exists within Holo for sites without an API or very poor/incomplete APIs (in the case of MAL). Daisuki has an API in use with their apps, the base links for which are in a comment at the top of daisuki.py. I haven't looked at the availability of information in too much depth, but my impression was there was enough available.

NewLunarFire commented 8 years ago

I just looked into the Daisuki API links. All of the information from the web is available except for épisode titles, but I don't think it's really nécessary. I'll update the scrapper to use the api instead.

TheEnigmaBlade commented 8 years ago

Just so you're aware, I refactored get_latest_episode into the super class to remove duplicate code. Derived services now only need to implement a get_all_episodes function.