ScriptSmith / instamancer

Scrape Instagram's API with Puppeteer
http://adamsm.com/instamancer
MIT License
398 stars 61 forks source link

[BUG] Scraping is not working anymore because Instagram requres authorization #45

Closed floss-order closed 3 years ago

floss-order commented 4 years ago

Describe the bug A clear and concise description of what the bug is. Scraping is not working anymore. The issue is caused by Instagram itself. You have to log into your account in order to use it. To make sureI’m right, I turned off headless mode and started instamancer. As expected, it shows the login page and istamancer is not able to do its work.

To Reproduce Steps to reproduce the behavior.

  1. Scrape something

Setup (please complete the following information):

Additional context I tried to fix this by creating a function that will run puppeteer and authorize me but the browser wasn't saving my data. According to puppeteer docs, you have to specify userDataDir property, which contains the path to the user data of your browser. The question I struggle with is how do I change this property in instamancer.

IORoot commented 4 years ago

Just my little contribution here. I've managed to jerry-rig a poor implementation of logging in with credentials if puppeteer is redirected to the login page.

Within the constructPage method in the instagram.js file I added in a check for the login page and attempt to insert the username/password if found. Seems to be working for me now. I'm not adding a pull request because this is very hacked together and I don't really know JS very well - It has hard-coded user/pass in it as well, which is bad practice.

However, I'm sure someone else can make a much better implementation of this.

Just remember to replace the username / password with your account details. replace YOUR_ACCOUNT_USERNAME_GOES_HERE and YOUR_ACCOUNT_PASSWORD_GOES_HERE with real creds.

    /**
     * Create the browser and page, then visit the url
     */
    async constructPage() {
        // Browser args
        const args = [];
        /* istanbul ignore if */
        if (process.env.NO_SANDBOX) {
            args.push("--no-sandbox");
            args.push("--disable-setuid-sandbox");
        }
        if (this.proxyURL !== undefined) {
            args.push("--proxy-server=" + this.proxyURL);
        }
        // Browser launch options
        const options = {
            args,
            headless: this.headless,
        };
        if (this.executablePath !== undefined) {
            options.executablePath = this.executablePath;
        }
        // Launch browser
        if (this.browserInstance) {
            await this.progress(Progress.LAUNCHING);
            this.browser = this.browserInstance;
            this.browserDisconnected = !this.browser.isConnected();
            this.browser.on("disconnected", () => (this.browserDisconnected = true));
        }
        else if (!this.sameBrowser || (this.sameBrowser && !this.started)) {
            await this.progress(Progress.LAUNCHING);
            this.browser = await puppeteer_1.launch(options);
            this.browserDisconnected = false;
            this.browser.on("disconnected", () => (this.browserDisconnected = true));
        }
        // New page
        this.page = await this.browser.newPage();
        await this.progress(Progress.OPENING);

        // Attempt to visit URL
        try {

            await this.page.goto(this.url);

            // ┌─────────────────────────────────────────────────────────────────────────┐ 
            // │                                                                         │░
            // │                                                                         │░
            // │                      CHECK FOR LOGIN PAGE HERE                          │░
            // │                                                                         │░
            // │                                                                         │░
            // └─────────────────────────────────────────────────────────────────────────┘░
            // ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
            try {

                this.logger.error("Checking if been redirected to Login page.");
                await this.page.waitForSelector('input[name="username"]', { timeout: 2000 });
                this.logger.error("Login Page found, attempting to use credentials.");
                await this.page.type('input[name="username"]', 'YOUR_ACCOUNT_USERNAME_GOES_HERE');
                await this.page.type('input[name="password"]', 'YOUR_ACCOUNT_PASSWORD_GOES_HERE');
                await this.page.click('button[type="submit"]');
                await this.page.waitFor(2000);

                // Save Details Button
                await this.page.waitForSelector('button[type="button"]');
                await this.page.click('button[type="button"]');

                // Notifications button
                await this.page.waitForSelector('button[tabindex="0"]');
                await this.page.click('button[tabindex="0"]');

                // Goto original URL Request, not login page.
                await this.page.goto(this.url);

            } catch (error) {
                this.logger.error("No LOGIN Screen found.");
            }

            // Check page loads
            /* istanbul ignore next */
            const pageLoaded = await this.page.evaluate(() => {
                const headings = document.querySelectorAll("h2");
                for (const heading of Array.from(headings)) {
                    if (heading.innerHTML ===
                        "Sorry, this page isn't available.") {
                        return false;
                    }
                }
                return true;
            });

            if (!pageLoaded) {
                await this.handleConstructionError("Page loaded with no content", 10);
                return false;
            }
            // Run defaultPagePlugins
            for (const f of this.defaultPageFunctions) {
                await this.page.evaluate(f);
            }
            // Fix issue with disabled scrolling
            /* istanbul ignore next */
            await this.page.evaluate(() => {
                setInterval(() => {
                    try {
                        document.body.style.overflow = "";
                    }
                    catch (error) {
                        this.logger.error("Failed to update style", { error });
                    }
                }, 10000);
            });

        }
        catch (e) {
            await this.handleConstructionError(e, 60);
            return false;
        }
        return true;
    }
navxio commented 4 years ago

Tried this myself, same issue. Generated a log file though-

instamancer.log Looks like the preliminary oauth request fails for some reason coming from a linux host...

Edit: Works just fine on my mac, but fails on an ubuntu vps

IORoot commented 4 years ago

So, I've been playing around with trying to get a proof of concept working to batch process 50 user requests in a row on my DigitalOcean server and I think I've just managed to crack it. There's a bunch of steps I took, and once I've put it all together I'll submit a pull request. However, here's the things that I think you need to solve:

  1. Puppeteer will open a new browser for every request. Essentially, I think Instagram was seeing that many new requests were happening and a lot of opening/closing of the browser. This was triggering their spam detection. On my server, I could get through nine requests before the rate limit.

To mitigate against this, I managed to get the code to loop across all of the requests while keeping the first instance of the browser open. This meant that Instagram thinks it's a single session where the user is just visiting multiple accounts.

  1. Login detection. If a login screen is detected, handling of that with some credentials was needed. So I supplied a creds.json file that was read and entered if the login page was detected.

  2. The new login location (location of your server) will be detected by Instagram. You need to manually accept through email that this is a new location and that it's you. (2FA)

  3. I've swapped puppeteer out for puppeteer-extra and am using the puppeteer-extra-stealth-plugin to help deter any bot-detection.

That's it. At the moment, the code is in a mess, but I think that this might point folks in the right direction. I've successfully just scraped 50 individual accounts from the server.

ScriptSmith commented 4 years ago

Hi all, thanks for your contributions. Unfortunately I haven't had much time to spend on instamancer recently, but I'm a little more free now. Hopefully I can provide some more insight.

I think Instagram could be flagging multiple unauthenticated requests from the same address with different session cookies and other headers. However I am more confident that the explanation is simply that they now block unauthenticated requests from popular cloud platforms, as instamancer was working very reliably on these platforms until recently, and now it doesn't work even with brand new connections.

Ultimately there is a tradeoff between the pattern of multiple short sessions from the same source, and a single long session from the same source. In the past, having multiple sessions proved to be advantageous, but perhaps this is no longer the case.

The instamancer module (not cli) has an optional argument called browserInstance which you can use to persist a single puppeteer browser between scraping jobs. The sameBrowser argument can also be used to stop instamancer initiating grafting with a separate browser.

I'm not sure if you have been using those two features in your private fork @IORoot, but if so I think they can be used to test whether it is more advantageous to persist a single session. If so, we can add more options to keep instagram cookies, persist profile data with userDataDir etc.

I don't know how useful puppeteer-extra-stealth-plugin is as I don't see any evidence of instagram looking for puppeteer. I attribute this mostly to the fact that puppeteer is not the most popular instagram scraping method.

One other thing to note is that I likely won't be including any instagram 'login' or other sophisticated user interaction mechanisms in instamancer. You can write plugins to interact with the instagram webpages yourselves, or use plugins written by others.

If people are interested in using plugins to have more intricate interactions with the webpage, then I can also look at making improvements to their usability. They're pretty easy to use if you're using instamancer as a node module, but it's quite hard to use them with the CLI.

IORoot commented 4 years ago

Hey @ScriptSmith, thanks for the comments and heads-up on the optional arguments. I didn't see them actually, and would have made life much easier! Oh well. I ended up creating a new command that behaved very similar to the posts command but called users. Which allows you to submit a CSV of multiple accounts. This would then loop through each one, keeping the browser open for all of them.

Completely understand the motivation to not do the login part, and really, that was probably the easiest part to do within the constructPage() method. It's wise to make that a plugin anyway, since I imagine the complexity of it will become more difficult in the future.

Once everything was running, I disabled the stealth-plugin and it made no difference, so I don't think that's needed right now.

My code isn't perfect by any means since I think I broke some of your functionality, which I need to fix (I'm new to TS - it's taking me time), but it seems to mostly work. The changes I made are all on my fork and it's happily running on the server.

lum1nat0r commented 4 years ago

@IORoot So you are telling me that your code is currently running on your server? How come I can't get it running :( I mean i have it running inside a Ubuntu-container, cloned your repo and installed all the dependencies, but I sill get the same error-message that @navxio showed in his logs. Maybe you have some suggestion what I could try to get it running? :)

IORoot commented 4 years ago

Yep, it's still running and working well. There are a LOT of gotchas with Instagram that you need to work your way through. Off the top of my head the main ones are:

  1. If it's running on a server, let's say with an IP address of 1.1.1.1 then Instagram will see that as a new IP address connecting to its service. With the login functionality I added, that account will get an email notification to say "Hey, we just saw a new connection from this browser/machine/IP 1.1.1.1 - is that you?". Which you'll need to confirm to say that's you.

  2. If your server IP is 1.1.1.1, sometimes Instagram will flag this as "suspicious behaviour" and send a 6-digit code to your email account to then add into the browser, right there and then. This is a problem because Instamancer can't deal with this. So, the way I fixed it was to install a proxy server on the machine (TinyProxy) and then use my laptop 2.2.2.2 to tunnel through the server 1.1.1.1, so I can have the same IP address as the server and then manually deal with the 6-digit code confirmation on my laptop. Once I've confirmed the "suspicious behaviour" as me, Instagram then sees the IP 1.1.1.1 as an OK IP address and won't flag it up again.

  3. I've added a "screenshot" function into instamancer that takes an image and places it into /tmp/instamancer/ at each step of the process so I can see where it's getting stuck. This definitely helps to debug what Instagrams current problem is.

  4. I've allowed the --proxyURL flag on the command line so I can proxy through any other servers I need to to help debug.

  5. I've added a --user and --pass flag now to allow the login steps to work instead of supplying a creds.json file. makes life easier.

I have noticed that Instagram sees that "Headless Chrome" and "Linux" is being used and may become an issue if it doesn't like that being used, to which I may return to the stealth puppeteer project.

diegofullstackjs commented 3 years ago

where I find the creds.json file

ScriptSmith commented 3 years ago

Instagram is now much more aggressively enforcing login.

See the notice in the README and #58