Open tobimori opened 11 months ago
Oh, thanks for letting me know it's broken, I haven't used it in a while so I haven't noticed it. I'll try to fix it when I got time. I guess it's still going to work, but maybe a bit slower. When trying to scrape it the last time I found some trick on Linux to make it easily scrape-able, maybe that still works
Any Updates on this?
Also interested in an update to this!
Is an update planned?
Apparently Anisearch recently changed their internal search engine structure, obviously due to too heavy scraping. As well numerous Playwright requests from the same IP now result in a temporary ban of your IP for about a week. So it is now crucial to set a proper user_agent header to prevent getting banned. It seems advisable to randomize the requests across several different user agent strings. As well it could be required sooner or later to be able to configure several proxies for further randomization and to spoof the originating IP of the scraping requests.
In addition, the original URL used in the script is obviously no longer valid, and a valid referer with the same domain is required now as well. I didn't manage so far to find out the proper new quicksearch URL pattern, so adding extra_http_headers and a valid referer in page.goto() command alone doesn't suffice.