andreekeberg / instagram-scraper

Instagram scraper, with support for users and tags
https://packagist.org/packages/andreekeberg/instagram-scraper
MIT License
7 stars 0 forks source link

The scraper doesn't work on hosted site. #1

Open PYovchevski opened 4 years ago

PYovchevski commented 4 years ago

Hello, it's seems that the scraper doesn't work on the hostings. On Localhost it works but when I upload my website to the hosting it's not working.

andreekeberg commented 4 years ago

Hi @PYovchevski, that's unfortunate to hear! Have you had any issues using other scrapers on the same server? It might be that Instagram is blocking the IP of the server, which is hard to work around without using a proxy.

I'm aware of the issue of scrapers being blocked by Instagram, and have actually only had success using this one when another scrapers had been blocked, so I haven't needed to update the way the requests are made.

However since this is an issue in general, and now for you using this package, I will speed up the process of releasing a new version, where the requests are done using cURL with some extra headers like User-Agent, to try to work around the blocking.

I'm leaving this issue open and will comment here when the new version is released, and hopefully it will solve your problem. I also recommend watching this repository to get instantly notified about new versions.

BramImhof commented 4 years ago

Hi @andreekeberg, adding the User-Agent won't solve this problem, I've tried the same with other scrapers. I think the problem is related to the IP address or hostname, but unfortunately I am not sure (yet).

andreekeberg commented 4 years ago

Hi @BramImhof! Yeah, I've been trying some different approaches here, as I've started having some issues on one of the servers I use it on now as well.

Will release a new version soon that uses cURL and some basic UA spoofing, since it's better than not sending any headers at all, but I'm aware that the root problem isn't solved by this.

I have an idea on how to scrape the site without getting blocked, which involves using a separate server that's behind a firewall, that does all the scraping as a scheduled task, and continuously sends the results to an API endpoint on actual web server, which should stop Instagram from figuring out that the scraping server isn't a regular browser (since all local requests seem to work).

However this isn't really something that could be built into this package, seeing that it's more a matter of server configuration. Thought I'd mention it though, in case you would find it interesting!

Will definitely post an update with my results using this technique.

andreekeberg commented 4 years ago

Version 1.1.0 is now released, which sends a randomized User-Agent header with every request, and while it does not fix the mentioned problem, it's still better than not sending any headers at all.