PatNeedham / google-it

command line Google search and save to JSON
105 stars 35 forks source link

Our systems have detected unusual traffic from your computer network. #26

Open ghost opened 4 years ago

ghost commented 4 years ago

Hi,

I'm using google-it in my project. I'm searching lots of query in a small time. After some time i got this error:

Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. Why di d this happen?

How could i pass it ? Thanks.

PatNeedham commented 4 years ago

Hi @kamilkoc-yusufayas ,

Thank you for reporting this. Can you elaborate on how those searches were made, and in what small time duration?

I haven't run into this myself, but while digging into some of the other open issues, I added a new debugging feature along the way to accomplish 2 things:

1) Save the resulting HTML page to a local file. 2) Use that local file instead of making the actual networking request for the live google search page.

For example:

$ google-it --query="test" --htmlFileOutputPath=tempOutput2.html
$ google-it -f ./tempOutput2.html # -f shortcut for '--fromFile' option

That let me use that same file over and over again, and I'm wondering if you could try with the latest version published today (v1.2.2) and upload here what that resulting HTML page looks like? It might be possible to detect that 'unusual traffic' page directly in the code, and subsequently retry with different configuration options (like userAgent field in the GET request).

ghost commented 4 years ago

Hi Pat,

I will try to make html output as you want.

When i look at my searches, i google-it a search query 15 times in every 5 seconds for a one minute. After one minute i'm getting that response from google-it. Google detects me as a robot, so it blocks my searches for 5-6 hours and it puts a captcha resolve. If i pass through captcha, it allows me to search. Unless you dont pass through captcha, your block will keep on.

What is solution? I found a solution by myself. I made bigger time period between search calls. i changed my time period from 5 seconds to 30 seconds. By this way, google can't detect you as a robot. But this solution comes with a handicap. Your run time will be 6x larger.

Maybe somehow google-it can detect block mechanizm and pass through captcha by itself.

Thanks for taking your time to me.

PatNeedham commented 4 years ago

I was playing around with this limitation a little bit yesterday, using AWS Lambda as the place to perform these searches. My lambda function looks like this:

const googleIt = require('google-it');
const util = require('util');
const fs = require('fs');
const AWS = require('aws-sdk');
AWS.config.update({region: 'us-east-1'});
const s3 = new AWS.S3({apiVersion: '2006-03-01'});
const upload = util.promisify(s3.upload.bind(s3));

const Bucket = '<my-bucket-name-here>';

const queries = [
// array with 20 random phrases
];

function sleep(duration) {
    return new Promise((resolve) => {
        setTimeout(() => {
            resolve();
        }, duration);
    });
}

const performSearchAndSaveResult = async (query, index) => {
    const filePath = `/tmp/${query}.html`;
    try {
        const results = await googleIt({ query, htmlFileOutputPath: filePath });
        await sleep(2000);
        const fileContent = fs.readFileSync(filePath, 'utf8');
        var params = { Bucket, Key: `bulk-experiment/${query}.html`, Body: fileContent };
        const data = await upload(params);
        return data;
    } catch (error) {
        return null;
    }
}

exports.handler = async (event) => {
    try {
        const promises = queries.map(performSearchAndSaveResult);
        const results = await Promise.all(promises);
        return results;
    } catch (error) {
        console.log(error);
        return { statusCode: 500, error };
    }
};

After running it, my S3 bucket gets populated with the resulting HTML pages (Side note - I needed to include await sleep(2000); because without it, the fileContent value was empty. That happens due to the googleIt function returning before the file was actually written to, which I'll address in a minor update later today so the time delay is no longer necessary). I'll also try with a larger query array size (100+) to see that captcha you are getting.

Suppose Google adds captcha that after X number of queries are made by the same IP address within Y number of seconds. One alternative to increasing the time period is to have the same lambda function like the example I included above, but duplicated across different AWS regions (us-east-1, us-east-2, eu-central-1, eu-west-1, etc). That way, a "master lambda" can accept the large array of search queries as its input, and can assign each search (or subset of searches) to one of the duplicated lambdas which would be performing that search. And if we're talking about thousands of queries, to the point where it wouldn't be feasible to perform the searches all at once (even in all the different AWS regions), that would probably require the lambda function to schedule one-time CloudWatch Events cron jobs for each batch of queries. I'm definitely thinking way too much into this, but it's entertaining!