ScriptSmith / instamancer

Scrape Instagram's API with Puppeteer
http://adamsm.com/instamancer
MIT License
398 stars 61 forks source link

[BUG] After scraping around 800 hashtags Instamancer reloads the browser #35

Closed Daniel-Griffiths closed 4 years ago

Daniel-Griffiths commented 5 years ago

Describe the bug When scraping for hashtag's, recently it seem's to fail after scraping around ~800 (this is fairly consistent). When reaching around 800 Instamancer restarts the browser and tries again from scratch.

It seems to be related to this line of code: https://github.com/ScriptSmith/instamancer/blob/07e664ea6b144f6d304c4c2cc2f7e957f53fa4f7/src/api/instagram.ts#L419

Specifically the this.start() method which causes the browser to reload.

And by looking at the network logs in chrome I can see that one of the graphql requests returns an error around the 800 post mark. Every other request after this one seems to work ok.

To Reproduce Search for any hashtag, and make sure the limit is higher than 800.

Setup (please complete the following information):

I will add more info here as I debug the issue further.

ScriptSmith commented 5 years ago

In my initial attempts to reproduce this, I am able to gather 1000 posts from a hashtag.

The restarting process you describe is what I call grafting, which allows instamancer to perform long scraping jobs by restarting the browser in order to limit resource usage. You can read about it on the website

Because using a browser consumes lots of memory in large scraping jobs, Instamancer employs a new scraping technique called grafting. It intercepts and saves the URL and headers of each request, and then after a certain number of interactions with the page it will restart the browser and navigate back to the same page. Once the page initiates the first request to the API, its URL and headers are swapped on-the-fly with the most recently saved ones. The scraping continues without incident because the response from the API is in the correct form despite being for the incorrect data.

and in the FAQ

What happens if I disable grafting?

Chrome / Chromium will eventually decide that it doesn't want the page to consume any more resources and future requests to the API will be aborted. This usually happens between 5k-10k posts regardless of the memory available on the system. There doesn't seem to be any combination of Chrome flags to avoid this.

This bug could be because when instamancer attempts to perform a graft by swapping request parameters on the fly after being restarted, something is going wrong.

So, a few questions:

Daniel-Griffiths commented 5 years ago

Hi @ScriptSmith

Thanks for the detailed response! I didn't event notice the FAQ document, that will be super handy.

From what I recall when grafting was triggered and the browser restarted it started scraping the hashtags from the very beginning which would put it into an infinite loop.

I will confirm this after I finish work and try to get a reproducible example. I will also answer the two questions you posted.

Daniel-Griffiths commented 5 years ago

Example failed requests with grafting disabled:

endpoint: https://www.instagram.com/graphql/query/?query_hash=174a5243287c5f3a7de741089750ab3b&variables=%7B%22tag_name%22%3A%22rebelgal%22%2C%22first%22%3A12%2C%22after%22%3A%22QVFCZndxMUV2QXlQalMyTVJ5ZUFqUDVraGRhc20wTmJfNkthMlZYa3kwSGZUODJid3JRWHp6VmQ2VUIxRTRNRWRzU0kzVlVCT0o2VER3SWVmWWl2Z3RHdg%3D%3D%22%7D

image

Each on of those failures happens roughly every 800 requests, this is with grafting disabled.

ScriptSmith commented 5 years ago

I think that error is caused by chrome cancelling requests due to resource limitations. Try cloning this repo and changing the value of jumpMod in src/api/instagram.ts to 50. That should cause grafting to be initiated more quickly.

ScriptSmith commented 4 years ago

Did you get a chance to try out the fix?

Daniel-Griffiths commented 4 years ago

Sorry @ScriptSmith I have not had a chance to try it. I will close this issue for now and reopen if I can get any further info.