Closed floss-order closed 3 years ago
Just my little contribution here. I've managed to jerry-rig a poor implementation of logging in with credentials if puppeteer is redirected to the login page.
Within the constructPage
method in the instagram.js
file I added in a check for the login page and attempt to insert the username/password if found. Seems to be working for me now.
I'm not adding a pull request because this is very hacked together and I don't really know JS very well - It has hard-coded user/pass in it as well, which is bad practice.
However, I'm sure someone else can make a much better implementation of this.
Just remember to replace the username / password with your account details.
replace YOUR_ACCOUNT_USERNAME_GOES_HERE
and YOUR_ACCOUNT_PASSWORD_GOES_HERE
with real creds.
/**
* Create the browser and page, then visit the url
*/
async constructPage() {
// Browser args
const args = [];
/* istanbul ignore if */
if (process.env.NO_SANDBOX) {
args.push("--no-sandbox");
args.push("--disable-setuid-sandbox");
}
if (this.proxyURL !== undefined) {
args.push("--proxy-server=" + this.proxyURL);
}
// Browser launch options
const options = {
args,
headless: this.headless,
};
if (this.executablePath !== undefined) {
options.executablePath = this.executablePath;
}
// Launch browser
if (this.browserInstance) {
await this.progress(Progress.LAUNCHING);
this.browser = this.browserInstance;
this.browserDisconnected = !this.browser.isConnected();
this.browser.on("disconnected", () => (this.browserDisconnected = true));
}
else if (!this.sameBrowser || (this.sameBrowser && !this.started)) {
await this.progress(Progress.LAUNCHING);
this.browser = await puppeteer_1.launch(options);
this.browserDisconnected = false;
this.browser.on("disconnected", () => (this.browserDisconnected = true));
}
// New page
this.page = await this.browser.newPage();
await this.progress(Progress.OPENING);
// Attempt to visit URL
try {
await this.page.goto(this.url);
// ┌─────────────────────────────────────────────────────────────────────────┐
// │ │░
// │ │░
// │ CHECK FOR LOGIN PAGE HERE │░
// │ │░
// │ │░
// └─────────────────────────────────────────────────────────────────────────┘░
// ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
try {
this.logger.error("Checking if been redirected to Login page.");
await this.page.waitForSelector('input[name="username"]', { timeout: 2000 });
this.logger.error("Login Page found, attempting to use credentials.");
await this.page.type('input[name="username"]', 'YOUR_ACCOUNT_USERNAME_GOES_HERE');
await this.page.type('input[name="password"]', 'YOUR_ACCOUNT_PASSWORD_GOES_HERE');
await this.page.click('button[type="submit"]');
await this.page.waitFor(2000);
// Save Details Button
await this.page.waitForSelector('button[type="button"]');
await this.page.click('button[type="button"]');
// Notifications button
await this.page.waitForSelector('button[tabindex="0"]');
await this.page.click('button[tabindex="0"]');
// Goto original URL Request, not login page.
await this.page.goto(this.url);
} catch (error) {
this.logger.error("No LOGIN Screen found.");
}
// Check page loads
/* istanbul ignore next */
const pageLoaded = await this.page.evaluate(() => {
const headings = document.querySelectorAll("h2");
for (const heading of Array.from(headings)) {
if (heading.innerHTML ===
"Sorry, this page isn't available.") {
return false;
}
}
return true;
});
if (!pageLoaded) {
await this.handleConstructionError("Page loaded with no content", 10);
return false;
}
// Run defaultPagePlugins
for (const f of this.defaultPageFunctions) {
await this.page.evaluate(f);
}
// Fix issue with disabled scrolling
/* istanbul ignore next */
await this.page.evaluate(() => {
setInterval(() => {
try {
document.body.style.overflow = "";
}
catch (error) {
this.logger.error("Failed to update style", { error });
}
}, 10000);
});
}
catch (e) {
await this.handleConstructionError(e, 60);
return false;
}
return true;
}
Tried this myself, same issue. Generated a log file though-
instamancer.log Looks like the preliminary oauth request fails for some reason coming from a linux host...
Edit: Works just fine on my mac, but fails on an ubuntu vps
So, I've been playing around with trying to get a proof of concept working to batch process 50 user requests in a row on my DigitalOcean server and I think I've just managed to crack it. There's a bunch of steps I took, and once I've put it all together I'll submit a pull request. However, here's the things that I think you need to solve:
To mitigate against this, I managed to get the code to loop across all of the requests while keeping the first instance of the browser open. This meant that Instagram thinks it's a single session where the user is just visiting multiple accounts.
Login detection. If a login screen is detected, handling of that with some credentials was needed. So I supplied a creds.json file that was read and entered if the login page was detected.
The new login location (location of your server) will be detected by Instagram. You need to manually accept through email that this is a new location and that it's you. (2FA)
I've swapped puppeteer out for puppeteer-extra and am using the puppeteer-extra-stealth-plugin to help deter any bot-detection.
That's it. At the moment, the code is in a mess, but I think that this might point folks in the right direction. I've successfully just scraped 50 individual accounts from the server.
Hi all, thanks for your contributions. Unfortunately I haven't had much time to spend on instamancer recently, but I'm a little more free now. Hopefully I can provide some more insight.
I think Instagram could be flagging multiple unauthenticated requests from the same address with different session cookies and other headers. However I am more confident that the explanation is simply that they now block unauthenticated requests from popular cloud platforms, as instamancer was working very reliably on these platforms until recently, and now it doesn't work even with brand new connections.
Ultimately there is a tradeoff between the pattern of multiple short sessions from the same source, and a single long session from the same source. In the past, having multiple sessions proved to be advantageous, but perhaps this is no longer the case.
The instamancer module (not cli) has an optional argument called browserInstance
which you can use to persist a single puppeteer browser between scraping jobs. The sameBrowser
argument can also be used to stop instamancer initiating grafting with a separate browser.
I'm not sure if you have been using those two features in your private fork @IORoot, but if so I think they can be used to test whether it is more advantageous to persist a single session. If so, we can add more options to keep instagram cookies, persist profile data with userDataDir
etc.
I don't know how useful puppeteer-extra-stealth-plugin
is as I don't see any evidence of instagram looking for puppeteer. I attribute this mostly to the fact that puppeteer is not the most popular instagram scraping method.
One other thing to note is that I likely won't be including any instagram 'login' or other sophisticated user interaction mechanisms in instamancer. You can write plugins to interact with the instagram webpages yourselves, or use plugins written by others.
If people are interested in using plugins to have more intricate interactions with the webpage, then I can also look at making improvements to their usability. They're pretty easy to use if you're using instamancer as a node module, but it's quite hard to use them with the CLI.
Hey @ScriptSmith, thanks for the comments and heads-up on the optional arguments. I didn't see them actually, and would have made life much easier! Oh well.
I ended up creating a new command that behaved very similar to the posts
command but called users
. Which allows you to submit a CSV of multiple accounts.
This would then loop through each one, keeping the browser open for all of them.
Completely understand the motivation to not do the login part, and really, that was probably the easiest part to do within the constructPage()
method. It's wise to make that a plugin anyway, since I imagine the complexity of it will become more difficult in the future.
Once everything was running, I disabled the stealth-plugin and it made no difference, so I don't think that's needed right now.
My code isn't perfect by any means since I think I broke some of your functionality, which I need to fix (I'm new to TS - it's taking me time), but it seems to mostly work. The changes I made are all on my fork and it's happily running on the server.
@IORoot So you are telling me that your code is currently running on your server? How come I can't get it running :( I mean i have it running inside a Ubuntu-container, cloned your repo and installed all the dependencies, but I sill get the same error-message that @navxio showed in his logs. Maybe you have some suggestion what I could try to get it running? :)
Yep, it's still running and working well. There are a LOT of gotchas with Instagram that you need to work your way through. Off the top of my head the main ones are:
If it's running on a server, let's say with an IP address of 1.1.1.1 then Instagram will see that as a new IP address connecting to its service. With the login functionality I added, that account will get an email notification to say "Hey, we just saw a new connection from this browser/machine/IP 1.1.1.1 - is that you?". Which you'll need to confirm to say that's you.
If your server IP is 1.1.1.1, sometimes Instagram will flag this as "suspicious behaviour" and send a 6-digit code to your email account to then add into the browser, right there and then. This is a problem because Instamancer can't deal with this. So, the way I fixed it was to install a proxy server on the machine (TinyProxy) and then use my laptop 2.2.2.2 to tunnel through the server 1.1.1.1, so I can have the same IP address as the server and then manually deal with the 6-digit code confirmation on my laptop. Once I've confirmed the "suspicious behaviour" as me, Instagram then sees the IP 1.1.1.1 as an OK IP address and won't flag it up again.
I've added a "screenshot" function into instamancer that takes an image and places it into /tmp/instamancer/
at each step of the process so I can see where it's getting stuck. This definitely helps to debug what Instagrams current problem is.
I've allowed the --proxyURL
flag on the command line so I can proxy through any other servers I need to to help debug.
I've added a --user
and --pass
flag now to allow the login steps to work instead of supplying a creds.json
file. makes life easier.
I have noticed that Instagram sees that "Headless Chrome" and "Linux" is being used and may become an issue if it doesn't like that being used, to which I may return to the stealth puppeteer project.
where I find the creds.json file
Instagram is now much more aggressively enforcing login.
See the notice in the README and #58
Describe the bug A clear and concise description of what the bug is. Scraping is not working anymore. The issue is caused by Instagram itself. You have to log into your account in order to use it. To make sureI’m right, I turned off headless mode and started instamancer. As expected, it shows the login page and istamancer is not able to do its work.
To Reproduce Steps to reproduce the behavior.
Setup (please complete the following information):
Additional context I tried to fix this by creating a function that will run puppeteer and authorize me but the browser wasn't saving my data. According to puppeteer docs, you have to specify userDataDir property, which contains the path to the user data of your browser. The question I struggle with is how do I change this property in instamancer.