Hamuko / cum

comic updater, mangafied
Apache License 2.0
170 stars 15 forks source link

Feature Request : Polling Waiter #78

Open aethrys opened 4 years ago

aethrys commented 4 years ago

I've been meaning to reply to issue #73 for a while, but I thought I'd just go ahead and open a feature-request.

It doesn't look like cum is using a waiter, or I haven't crawled over the scraper functions enough to see one. If a waiter was used in between page scrapes, then I'm sure many connectivity concerns would be put at bay (it's never perfect if it ends up being a cat and mouse game). A waiter is typically used to keep from constantly pegging your external resources, otherwise they end up DoS'ed. The idea is to wait random intervals between checks, to give yourself a better opportunity to "get in" if multiple reads are occurring at once (so you're not predictable) and also to appear a bit more actual-person-like.

The environment I'm most familiar with is AWS (but other cloud providers do it too), where they require one if you're using their APIs to scrape json, etc. With them, if you're "bursty" too often you'll be rate-limited, or if you're consistently polling you'll be IP-banned. It's just something you expect to implement when polling over the web... Cloudflare's no different here.

In Mangadex's case, I've had pretty good success by just adding a sleep call at random intervals to mangadex.py. There's a python library with fancy features, but a waiter is just literally waiting randomly...


    def download(self):
        if getattr(self, 'r', None):
            r = self.r
        else:
            r = self.reader_get(1)

        chapter_hash = self.json['hash']
        pages = self.json['page_array']
        files = [None] * len(pages)
        # This can be a mirror server or data path. Example:
        # var server = 'https://s2.mangadex.org/'
        # var server = '/data/'
        mirror = self.json['server']
        server = urljoin('https://mangadex.org', mirror)
        futures = []
        last_image = None
        with self.progress_bar(pages) as bar:
            for i, page in enumerate(pages):
                if guess_type(page)[0]:
                    image = server + chapter_hash + '/' + page
                else:
                    print('Unkown image type for url {}'.format(page))
                    raise ValueError
                ### simple waiter
                if i % 2 == 0:
                    sleep(randrange(1, 6))
                ### 
                r = requests.get(image, stream=True)
                if r.status_code == 404:
                    r.close()
                    raise ValueError
                fut = download_pool.submit(self.page_download_task, i, r)
                fut.add_done_callback(partial(self.page_download_finish,
                                              bar, files))
                futures.append(fut)
                last_image = image
            concurrent.futures.wait(futures)
            self.create_zip(files)

It was a quick five-minute test, but it's held up for a few months now on my server. Every other chapter it waits 1-6 seconds before attempting to scrape the next. Prior, I'd have issue with multiple chapter batches or groups of more than 15-ish pages. So far, I've been okay with scrapes involving 40-80 chapters, around 20-30 pages each. Anything above those values I can still have trouble with, but the cron-job usually catches it on the next run. I haven't messed with it much, but increasing the second range values has had more reliability... it just takes a longer to complete of course.

Waiting between page or chapter scrapes would be a nice addition, but it's not a perfect solution. I've been briefly looking into picking up at the exact page when Cloudflare cuts me off and an exception is raised. If I could track pages between errors and then wait a few seconds, maybe over a minute, and pick up where it left off... then the scraper would be fairly reliable I think.

With this all running on a regular basis in the background, I've got some other quick-hack-jobs going. I split up my madokami and mangadex downloads into separate processes, since madokami requests have stayed active more often than mangadex (and I want to limit bottlenecks). The other is just simply waiting randomly between downloads (each series at a time). For series without any updates, 2-7 seconds... for series with actual scrapes going on, 40-80 seconds. With all of that, I've been pretty happy with how much and how often I can pull from mangadex now.

So, just some ideas for anyone to look into or try.