XMLTV / xmltv

Utilities to obtain, generate, and post-process TV listings data in XMLTV format
GNU General Public License v2.0
266 stars 93 forks source link

Improving speeds by adding parallelism #138

Closed fugkco closed 3 years ago

fugkco commented 3 years ago

Hello all,

I've been using tv_grab_uk_tvguide and it is great! Thank you all for your hard work.

I'm curious, I've noticed that tvguide, and I'm assuming other components too, run the scrapers in sequentially, which is causing the guide to take a long time to download. As an anecdotal example it took me about ~10 minutes on a 4c/8t CPU (Intel i7-7700HQ, which is a relatively beefy CPU), but also in CI environments, with significantly fewer resources, it takes equivalent time so it is clear resources aren't the issue.

As I mentioned from what I'm seeing in the output, it seems to be downloading pages sequentially, and I wonder if it is possible to increase parallelism so that more pages can be downloading at the same time, and thus the overall speed of loading the guide would be quicker. I don't know a lot of perl, but having looking at the source, it seems that there is little to no support for this as it stand. So I can imagine adding this feature would require significant effort.

Anyway, I'd like to request this as a feature if at all possible. I think it would be a great addition.

Thanks

honir commented 3 years ago

Speed when crawling/scraping websites is an issue that is frequently misunderstood.

I've seen claims made by software providers that their scraper is "really fast" or the "fastest you can get". They are totally missing the point. Scraping websites is generally prohibited by copyright (unless the terms specifically allow it) so the last thing you want to do is to put your hand-up and say, "look at me, look at me, I'm filching your website data". That will just get your IP address, and possibly everyone else's, blocked.

Even when the Ts&Cs permit downloading you should still not retrieve web pages as fast as your computer will do it. That's just un-neighbourly (and could resemble a DoS attack).

If you run a website then look at the logs for when your website is crawled by one of the major bots, and you will see they limit page retrieval to one request per 3 or 4 seconds. Obviously Googlebot,Bing,etc could do this a lot faster but they choose not to. Ask yourself why?

A google search for "website scraping best practices" will give some clues. .

Also consider why you want it to be faster. In the case of the data retrieved here (i.e. for xmltv) then it does not matter whether it takes 5 minutes or 5 hours - schedule the run overnight and the data will be ready for you in the morning.

The TVGuide programme schedules rarely change more than once a day, so fetch the data overnight and store it.

If you want a faster retrieval then I suggest you subscribe to a paid-for TV schedule service which offers direct data download (no scraping).

Slow and steady is the best approach to use. The idea is not to thrash the target website: that will likely just get you blocked altogether.

fugkco commented 3 years ago

Totally understand all of that, I've done my fair share of site scraping at a previous role. I'm not expecting to increase the concurrency to maximise it, but more to speed up the process. My thinking is to do say 2 entries at once as opposed single entries. Much of the slowness is waiting for the network io. Having two workers seems reasonable, and should roughly double the speed, without putting any much load on the websites.

Running it overnight isn't the issue, I've tried doing so on a raspberry pi (via TVHeadend) but it seems to fail due to the sheer number of items. So I resorted to running it on GitHub Actions, however, scraping all the channels seems to take over 6 hours, which causes the workflow to timeout.

Anyway, I do understand your concerns, and if it's not something you're comfortable with implementing, then no worries! I'll leave it for your to decide on closing this issue. Thanks for taking the time!

honir commented 3 years ago

If you double the speed (e.g. reduce the latency, or increase the threads) then you start to make it look less like the website is being accessed by a human, and more like you are a bot. Increases your chance of getting your IP blocked. Loading on the target website isn't the issue: showing your hand as a scraper bot is the thing you want to avoid.

Likewise, downloading data for a couple of hundred channels just screams "bot!". This particular grabber wasn't intended for such industrial-scale use. (And it almost certainly breaks TVGuide's Ts&Cs.)

If you want to get 100s of channels - and very quickly - then I can recommend the two grabbers [1] which access the Schedules Direct data service. An SD subscription (for personal use) costs under £20 p.a. and is very reliable. There are a few quirks with the data (e.g. one or two missing UK channels) but generally quality is good.

. [1] tv_grab_zz_sdjson_sqlite or tv_grab_zz_sdjson