JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.39k stars 702 forks source link

IndexError of Instagram #520

Open QihanWangCo opened 2 years ago

QihanWangCo commented 2 years ago

Hi, I want to use snscrape for collect instagram data. My code is:

import snscrape.modules.instagram as sninstagram
import pandas as pd

query='google' #change name
ins_s=[]
limit=10
for ins in sninstagram.InstagramHashtagScraper(query).get_items():
      print(vars(ins))
      break

And I got this error:

jsonData = r.text.split('')[0] # May throw an IndexError if Instagram changes something again; we just let that bubble. IndexError: list index out of range

How can I fix it?

mettsal commented 2 years ago

Same issue here. Had a working code since 05/07/22, has basically the same structure as yours, and it ran fine. Untill today, that is: now it breaks at the same line (I believe) - during the .get_items() in the for loop.

Also adding another part of the error that may have to do with the issue "_logger.warning(f'Page does not exist')".


    [106][...]/Python310/lib/site-packages/snscrape/modules/instagram.py?line=105) def get_items(self):
--> [107][...]/Python310/lib/site-packages/snscrape/modules/instagram.py?line=106)  r = self._initial_page()
    [108][...]/Python310/lib/site-packages/snscrape/modules/instagram.py?line=107)  if r.status_code == 404:
    [109][...]/Python310/lib/site-packages/snscrape/modules/instagram.py?line=108)      _logger.warning(f'Page does not exist')``` 
JustAnotherArchivist commented 2 years ago

As the comment there suggests, this is due to changes on Instagram's side. They recently overhauled their site a bit. The scraper needs to be adapted to those changes.

QihanWangCo commented 2 years ago

As the comment there suggests, this is due to changes on Instagram's side. They recently overhauled their site a bit. The scraper needs to be adapted to those changes.

Thanks for your answer! Really looking forward to the adaption!!

kallewesterling commented 2 years ago

Any updates on this yet? Curious if we can help somehow!

TheTechRobo commented 1 year ago

Any updates on this yet? Curious if we can help somehow!

If you're a programmer, you could send a fix via the "pull requests" feature (or just by suggesting a fix!).

kallewesterling commented 1 year ago

Yeah, I know how GitHub works — just wanted to know whether there is any active development happening elsewhere on this particular issue.

barisulgen commented 1 year ago

Is this is a dead repo now?

JustAnotherArchivist commented 1 year ago

No, but there hasn't been anything worth saying.

This issue, along with any other Instagram or Facebook issues, is effectively blocked by their silly rate limits. They make development of the corresponding scrapers very annoying since rapid testing is very tricky. I haven't had time to look into possible workarounds to make that less unpleasant and less time-intensive. So for now, those scrapers are unfortunately poorly supported by me. I'll happily consider PRs though.

purut18 commented 1 year ago

Hey @JustAnotherArchivist, I'm trying to solve this issue. Can you share what we're looking for in the source code returned?

Is it a JSON link or plain JSON? Currently, there is no script with the type "text/javascript" returned by Instagram.

It would be great if you could share what was being stored in "jsonData" before this error came. Thanks!

JustAnotherArchivist commented 1 year ago

@purut18 I don't recall the exact format etc., but it was basically some context information (profile, hashtag, location, etc.) and the first page of posts, I believe.

purut18 commented 1 year ago

Well... nothing like that is being returned in the source code of Instagram now. (If someone else can confirm this, please?)

I think Instagram changed it or moved to dynamic rendering to prevent scrapping :/

0bmay commented 1 year ago

I am working on a fix for Instagram. So far searching by user and hashtags are working. Location will be soon™️

kallewesterling commented 1 year ago

In #1001?

0bmay commented 1 year ago

logged out users for locations always returns a single page of data and there is a pretty strict rate limit on getting data from the platform. But data is returned, for now.

feusagittaire commented 1 year ago

@0bmay i keep getting "IndexError: list index out of range" when trying to "for post in sns.InstagramHashtagScraper(query).get_items()" how could i resolve this? ;/

TheTechRobo commented 1 year ago

@feusagittaire The pull request hasn't been merged to snscrape yet

feusagittaire commented 1 year ago

logged out users for locations always returns a single page of data and there is a pretty strict rate limit on getting data from the platform. But data is returned, for now.

Tysm for that! If I may ask, it will be implemented in any time soon?

TheTechRobo commented 1 year ago

@feusagittaire Until the pull request is merged, you should be able to do a pip install -U git+https://github.com/0bmay/snscrape@insta_fix to install their copy of snscrape.

feusagittaire commented 1 year ago

tysm for the tip!!