ScriptSmith / instamancer

Scrape Instagram's API with Puppeteer
http://adamsm.com/instamancer
MIT License
399 stars 61 forks source link

[FEATURE] User Profile Text Information Search #26

Closed bwyyoung closed 3 years ago

bwyyoung commented 5 years ago

https://www.instagram.com/spyvonne_chloe/

For Instamancer, I cannot seem to find a way to retrieve the text found in a user's profile, such as the one in the example above: Yvonne & Chloe FTWM . Cancer survivor since 2007. Www.Supreme-Parents.com

Is there a way to retrieve this text using instamancer's search?

ScriptSmith commented 5 years ago

It's not really something this project is intended for, but I'll consider adding it.

In the meantime, you can do this with plugins in ES2018 typescript:

import { IPlugin, IPluginContext, createApi } from "instamancer";

type PageData = { entry_data: { ProfilePage: [{ graphql: { user: {} } }] } }

class UserData<PostType> implements IPlugin<PostType> {
    constructionEvent(this: IPluginContext<UserData<PostType>, PostType>) {
        const oldStart = this.state.start

        this.state.start = async () => {
            await oldStart.bind(this.state)()
            const data: PageData = await this.state.page.evaluate(() => {
                //@ts-ignore
                return window["_sharedData"]
            })
            console.log(data.entry_data.ProfilePage[0].graphql.user);
            await this.state.forceStop(true)
        }
    }
}

const user = createApi("user", "spyvonne_chloe", {
    plugins: [
        new UserData(),
    ],
})

user.start()
bwyyoung commented 5 years ago

I got this error: (node:47481) UnhandledPromiseRejectionWarning: TypeError: Cannot read property '0' of undefined

bwyyoung commented 5 years ago

I found out the issue. Basically Instagram requires after several calls of the plugin you wrote. It works initially, and I am able to retrieve the graphql data. However after several calls it will return a html page requesting the user to login.

Is there any way around this that we can resolve this issue without the use of plugin? I tried this below method as well, but eventually it stops working after a while if I stay on the same IP address: https://learnscraping.com/scraping-instagram-profile-data-with-nodejs/

If I change IP, like through mobile phone hotspot, it works again.

bwyyoung commented 5 years ago

Is there a way user profile scraping can be integrated via Instamancer? if not, would you have a rough idea of how to do it with puppeteer or adapting your code to have this additional function?

ScriptSmith commented 5 years ago

Well you're probably being rate limited because you're asking too much of Instagram. Make sure you're not doing anything else with Instagram in the background, and try sleeping 5 seconds between scraping each profile.

bwyyoung commented 5 years ago

Already tried sleeping. The same problem happens when using the plugin.

However, instamancer itself works just fine. I am still able to see the whole JSON that was output from instamancer, but the plugin method does not work.

If instagram was rate limiting me, shouldn't instamancer fail as well? Is there a way around this problem by doing things the way that instamancer works?

ScriptSmith commented 5 years ago

I'm unable to reproduce rate-limiting when sleeping between users. Does the following example work for you?

https://gist.github.com/ScriptSmith/b437b33c4f2005eb197f63c3a28f9dab

bwyyoung commented 5 years ago

Thank you for the example. It worked initially, and I tested it through a different new IP address. However, after about 1 hour of sleep based requests for user info, Instagram blocks further requests and asks the user for login.

It seems that this method only works temporarily, and isnt so reliable.

ScriptSmith commented 5 years ago

Well I don't think what you're after is really possible without either logging in, or sleeping longer and 'hibernating' for a while when instagram rate limits you. You might find luck with the other tools listed at the bottom of the README, but I doubt they'd be any better.

bwyyoung commented 5 years ago

I see. Is it not possible to obtain profile information with the way Instamancer works? This is because despite being rate limited, I can still use Instamancer's technique to obtain hashtag info.

ScriptSmith commented 5 years ago

The plugin implements how I'll add profile scraping to instamancer. There's no 'api' as such to retrieve profile information like there is for posts in a hashtag, so it has to be read from the page or memory. You can read about how instamancer works here.

The only difference is that the plugin uses a new session for each profile. Using the same session (like instamancer post postid1,postid2...) to gather multiple profiles may work better to mitigate rate limiting, or it may have the opposite effect.

bwyyoung commented 5 years ago

I understand. Thank you for clarifying that. But do you think there might be a way to do it if we are able to input a page token/user api token from facebook into instamancer as an option: https://developers.facebook.com/docs/instagram-api/reference/user

Here, business user information is able to be obtained using a GET from the Facebook API. The only thing we need is a token to be input as part of the get in order to retrieve the JSON information. This way, we won't have rate limiting/blocking from Instagram.

ScriptSmith commented 5 years ago

Instamancer is a web scraper, so I wouldn't consider implementing something that directly interacts with a regular API, there are better tools for that.

There are many tools available that work with facebook's graph api, I'd recommend using one of those instead.

bwyyoung commented 5 years ago

Ok Sir. Thank you for your help.