benbusby / whoogle-search

A self-hosted, ad-free, privacy-respecting metasearch engine
https://pypi.org/project/whoogle-search/
MIT License
9.36k stars 925 forks source link

[FEATURE] anti-captcha support. #211

Open ghost opened 3 years ago

ghost commented 3 years ago

with stuff that can get blocked easily, anti-captcha support would be huge. invidious has it implemented perfectly, and it allows public instances to be used with out any major rate-limiting. just an idea. thanks!

benbusby commented 3 years ago

I do agree that would be a nice feature. I haven't encountered the captcha issue too much (thankfully) but I'd still like to eliminate it altogether.

My understanding of anti-captcha services at the moment is that the worthwhile ones aren't free, which unfortunately prevents it from being something I can universally implement in the code base. I suppose I could put the responsibility on the public instance maintainer to provide an API key for activating the service -- I'm not sure how Invidious and others handle it, but I assume it's something like that.

In any case, I'll look into it. Thanks for the recommendation!

ghost commented 3 years ago

Invidious implements it very nicely, I actually didn't know it was an option until talking with the developers. You'd just add a line in your config for your API key. I'm willing to pay if it means my instance can be used without worry of getting blocked, plus it's super cheap.

unixfox commented 3 years ago

I do agree that would be a nice feature. I haven't encountered the captcha issue too much (thankfully) but I'd still like to eliminate it altogether.

My understanding of anti-captcha services at the moment is that the worthwhile ones aren't free, which unfortunately prevents it from being something I can universally implement in the code base. I suppose I could put the responsibility on the public instance maintainer to provide an API key for activating the service -- I'm not sure how Invidious and others handle it, but I assume it's something like that.

In any case, I'll look into it. Thanks for the recommendation!

You can implement the anti-captcha API, it's not universal nor a standard, but it's very common and easy to clone.

A lot of projects provide an anti-captcha API clone like https://capmonster.cloud or mine which I plan to release it publicly as soon as I find it stable.

Implementing an anti captcha solution into whoogle is a great way to provide the tools for public instances maintainers to offer a reliable service that work even when Google is trying to rate limit the server.

Albonycal commented 3 years ago

Yea I'm also getting blocked by google... This would be cool.. Any updates? Thank you :D

maxdesalle commented 3 years ago

Also having this issue on a DigitalOcean droplet.

randomwalk3141592 commented 3 years ago

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

unixfox commented 3 years ago

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted.

Albonycal commented 3 years ago

Hmm.. I found a workaround for this on Heroku instances.. restarting the heroku dyno assigns the instance a new IP address.. So I never faced the captcha issue after this.. & afaik this doesn't cause any downtime

randomwalk3141592 commented 3 years ago

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted.

I'm not sure I understand. My Whoogle docker is querying Google. Google responds with a captcha. Whoogle just needs to let that captcha come through and display it in the browser so I can solve it. This shouldn't involve another domain.

randomwalk3141592 commented 3 years ago

Hmm.. I found a workaround for this on Heroku instances.. restarting the heroku dyno assigns the instance a new IP address.. So I never faced the captcha issue after this.. & afaik this doesn't cause any downtime

I use a VPN and Whoogle docker queries Google through this VPN. Once in a while, Whoogle will get a captcha and I would have to reconnect the VPN connection so that I get a new public IP address. I know this solves the problem.

What is interesting is that when Whoogle chokes on the Google captcha, if I go to Google directly (also thought the VPN, so my direct connection would come from the same public IP as Whoogle docker), Google does NOT show me a captcha.

It seems Google is somehow detecting that the Whoogle query is "weird" while my direct query to Google from my computer is not weird.

Albonycal commented 3 years ago

hmm.. Can it be different user agents? or fingerprint thing?

unixfox commented 3 years ago

Rather than an anti-captcha feature, can you make it so that Whoogle will actually display Google's captcha so that it can be solved? Right now, the captcha is not displayed.

That's not possible, you can't load a same key of Google Recaptcha from another domain that is not whitelisted.

I'm not sure I understand. My Whoogle docker is querying Google. Google responds with a captcha. Whoogle just needs to let that captcha come through and display it in the browser so I can solve it. This shouldn't involve another domain.

Whoogle is not a browser, it doesn't interpret JavaScript so it can't "show" you the CAPTCHA.

hmm.. Can it be different user agents? or fingerprint thing?

No Google rate limit based on the IP address and that's it.

unixfox commented 3 years ago

Just wanted to say that there is a way to bypass Google reCAPTCHA entirely, I explained how here: https://github.com/searxng/searxng/issues/159

@benbusby "just" need to switch to this special endpoint, and we will have the CAPTCHA issue fixed.

JaneJeon commented 2 years ago

I honestly think all of this can be solved by using a better scraping method. I worked on scrapers to get pass "gated" sites such as GSRPs (Google Search Result Pages), paywalls, etc. and the reliability of scraping (i.e. not getting cockblocked by a captcha because they detected that you were a "bot" - making a request not from their frontend) comes down to these factors:

  1. IP (holy shit people, this is the number 1 thing that gets you blocked by Google. USE THEM PROXIES!!)
  2. Rate limiting (how many requests per second/minute/hour are you sending to Google, per IP?)
  3. SSL fingerprinting (browsers make HTTPS requests in a different manner than just calling requests.get() does
  4. Browser Fingerprinting (this is the big boy shit, and you almost never have to worry about it, except client-side rendered stuff, which is most definitely not GSRPs)

Now, 1 should be solved with proxies, 2 should be solved with careful rate limiting implementation (esp. w/ multiple proxies) within whoogle, 3 can be solved with careful HTTPS handshake implementation within whoogle, and 4 can be implemented using something like playwright plus browser stealth libraries that plug into playwright OR (given that this application is written in python and won't be able to use those stealth libraries that are typically written in js) use playwright to control a "stealthy" web browser instance, such as https://github.com/ulixee/secret-agent. Note that 4 is extreme overkill for most people's use cases (most bot solutions grade you on a sliding scale, so as long as you get 1, 2, and 3, your score will be still high enough to not require this bullshit)!!

I literally never get blocked on Google this way (not from Whoogle, my private application), no matter how many requests I send. Whoogle should adopt at the very least 2 and 3, and really direct people towards using a proxy (instead of trying to remove the captcha - that is a losing solution). Honestly, that should suffice to close this issue once and for all.

JaneJeon commented 2 years ago

Actually, ignore all that bullshit I said above, @unixfox's method is 10000 times easier. We should do that.

unixfox commented 2 years ago

New way of fetching the Google results (search, videos, news, images and more) with an internal API of Google and with JSON results: https://github.com/searxng/searxng/issues/1642! This doesn't have any rate limit.