Extravi / araa-search

A privacy-respecting, ad-free, self-hosted Google metasearch engine with strong security that offers full API support and utilizes Qwant for images, and DuckDuckGo for auto-complete.
https://araa.extravi.dev
GNU Affero General Public License v3.0
195 stars 18 forks source link

reCAPTCHA proxy #103

Closed Extravi closed 6 months ago

Extravi commented 6 months ago

I'm working on a system that allows users to interact with reCAPTCHA. Whenever Araa gets rate-limited, it will then load a web driver to proxy the captcha, allowing users to interact with it. If the user successfully completes the captcha, the web driver will then capture the "GOOGLE_ABUSE_EXEMPTION=ID" cookie and send it in the request header to Google using makeHTMLRequest. Both SearXNG and other projects do not do this, so this will be the first.

image image

Extravi commented 6 months ago

i might have to do something like this and update the image in real time because its not simple to scrape

image

Extravi commented 6 months ago

image

Extravi commented 6 months ago

it will first display that to the user because each user needs its own session to prevent more then one user doing the same captcha at once

amogusussy commented 6 months ago

I think a good solution could be having a backup search engine. When the instance gets rate limited, it should just switch to another engine, like qwant, then wait ~30 minutes, and retry google. Then repeat until google stops rate limiting.

Extravi commented 6 months ago

I think a good solution could be having a backup search engine. When the instance gets rate limited, it should just switch to another engine, like qwant, then wait ~30 minutes, and retry google. Then repeat until google stops rate limiting.

I already wrote most of the code for the captcha proxy already

Extravi commented 6 months ago

also look at librey and its fall back system it's not necessarily clean or well done

Extravi commented 6 months ago

I will likely add support for other engines at some point but the user should be able to use Google if they want rate limited or not

Extravi commented 6 months ago

that's why I'm working on the proxy

Extravi commented 6 months ago

@amogusussy i found a better way to proxy the captcha using sessions and a iframe sending data to the iframe from the server like the sitekey and s-data image

amogusussy commented 6 months ago

Have you tried that with a different device? If you send an iframe to the user's device, whatever site it's loading will just think it's a request from the user, so it wont be rate limited. This will probably only seem like it's working, since you're testing it on the device that's rate limited. A non rate limited user will just get sent the normal page.

Extravi commented 6 months ago

Have you tried that with a different device? If you send an iframe to the user's device, whatever site it's loading will just think it's a request from the user, so it wont be rate limited. This will probably only seem like it's working, since you're testing it on the device that's rate limited. A non rate limited user will just get sent the normal page.

yes im aware of the normal page thing i have been testing for hours ill be fine once its out/done

Extravi commented 6 months ago

its going to need an entire local proxy server for this to work https://mitmproxy.org/ i found this but do you know any better http proxies?

Extravi commented 6 months ago

recaptcha needs to be done using the servers ips so i need to proxy everything to the end user

amogusussy commented 6 months ago

I've found this list of alternatives for Linux, but I don't really know what makes a proxy better/worse.

Extravi commented 6 months ago

I think I should use a paid Captcha solver service

Extravi commented 6 months ago

because it's not necessarily practical to proxy it to my users or even possible

Extravi commented 6 months ago

i'm going to drop this for now and add support for a different engine as a backup

Extravi commented 6 months ago

any ideas for what engine i should use for the backup?

Extravi commented 6 months ago

Also, I will be implementing the backup engine, so there is a template to build off of.

Extravi commented 6 months ago

i want everything to look like it belongs unlike LibreY and its broken system

Extravi commented 6 months ago

a captcha proxy cost less then google search api

amogusussy commented 6 months ago

Qwant has a free API. The only problem is that it doesn't show the wikipedia results in the API, so you'll have to scrape that yourself. There's also DuckDuckGo, Startpage, yahoo, and brave. If you need help with scraping them, you can have a look at SearXNG's source code, since all the engines I mentioned are included in SearXNG.

I think we should standardize the results into one dict/json object, like what's been done with the torrent results. If we do that, it'll be 10x easier to add new engines, and maybe even give the user the ability to choose what engines they want to use.

Extravi commented 6 months ago

question what do you think of anonymous data submission of search results as an opt in future it would only collect the sub domain and domain for each result for the query but it wouldn't record the query itself so like www.youtube.com or GitHub.com nothing after the /

Extravi commented 6 months ago

I would use that data to index and improve aspects of the search results and make results more visual

Extravi commented 6 months ago

such as favicon indexing etc I might collect YouTube channel URLs too so the / for that but that's so I can index all channels over 10k subscribers

Extravi commented 6 months ago

so i can do things like this image and this image image

Extravi commented 6 months ago

the data collection code would be open source and anonymous

Extravi commented 6 months ago

and if the user wants to opt out of the setting turned on by default in settings they can

Extravi commented 6 months ago

i want to index some stuff that each engine can use like qwant google etc in Araa

Extravi commented 6 months ago

i want to make results look more visual and modern

Extravi commented 6 months ago

I want it to be on par with closed-source meta search engines, and for that to work, some data collection may be required.

Extravi commented 6 months ago

its only an idea and does not mean it will happen

Extravi commented 6 months ago

its something i want to do but if i do decide to develop it something might change resulting in it getting dropped

amogusussy commented 6 months ago

Something like that seems too far out from the reach of this project. I do think a feature like that could be good through. If there's a link within the first 3 results that links to youtube.com/c/ (to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel. That could also expand further into other widgets, like for weather, or sports results.

Extravi commented 6 months ago

it's not necessarily far out of reach it's common to come across the same websites in the search results for different queries and many people will search/request the same websites time from time so after the first request it will index the favicon etc and pair it with that sub domain and domain

Extravi commented 6 months ago

like medium and other articles sites are quite common or even stack overflow for a coding related query

Extravi commented 6 months ago

most people only really go to the top 1000 or so sites and it will naturally index's information for thos sites overtime and many other sites

Extravi commented 6 months ago

it may seem far out of reach but when you really think about it and user habits it isn't impossible

Extravi commented 6 months ago

the indexer application would have to be a separate project from this repo and this repo would only use the data it produces using the collected data from this repository

Extravi commented 6 months ago

Something like that seems too far out from the reach of this project. I do think a feature like that could be good through. If there's a link within the first 3 results that links to youtube.com/c/ (to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel. That could also expand further into other widgets, like for weather, or sports results.

also due to speed it will only show that data after it has it indexed by the other application

Extravi commented 6 months ago

the indexer might be MIT or something I'm not sure but any data it produces likely won't be subject to GPL

Extravi commented 6 months ago

Something like that seems too far out from the reach of this project. I do think a feature like that could be good through. If there's a link within the first 3 results that links to youtube.com/c/ (to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel. That could also expand further into other widgets, like for weather, or sports results.

yes I want to add things like weather news and spots thos are also topics I want to index

Extravi commented 6 months ago

I wouldn't index text results because it's more compacted then news or other topics/subjects

Extravi commented 6 months ago

I would only get data to associate with text results

Extravi commented 6 months ago

This is a local autocomplete demo with some data collection. It could improve a ton, and then there would be no need to relay on DuckDuckGo, making it faster with some optimization. https://github.com/Extravi/araa-search/assets/98912029/9227502a-7a65-43b1-8e90-4912c157a86a

Extravi commented 6 months ago

This is a local autocomplete demo with some data collection. It could improve a ton, and then there would be no need to relay on DuckDuckGo, making it faster with some optimization. https://github.com/Extravi/araa-search/assets/98912029/9227502a-7a65-43b1-8e90-4912c157a86a

read how many lines it has of right now

amogusussy commented 6 months ago

I'm more talking about how we'd need to use things like databases for all the favicons. If you want good speeds for it, you'll need to use a dedicated database, like SQLite, rather than using python dicts. It probably could be done, but it might take a bit of time to do it right.

Does the search suggestions deal with misspelled words? If I go to DuckDuckGo and type 'liux', it gives a suggestion of 'linux', because it can guess what I was probably going for. Does this have anything similar yet?

Extravi commented 6 months ago

I'm more talking about how we'd need to use things like databases for all the favicons. If you want good speeds for it, you'll need to use a dedicated database, like SQLite, rather than using python dicts. It probably could be done, but it might take a bit of time to do it right.

Does the search suggestions deal with misspelled words? If I go to DuckDuckGo and type 'liux', it gives a suggestion of 'linux', because it can guess what I was probably going for. Does this have anything similar yet?

yes it does check for spelling image

Extravi commented 6 months ago

image

amogusussy commented 6 months ago

That looks good then. I think it should still keep duckduckgo by default though, unless you make a way for it to actually guess what the user's going to type, beside using a list to look it up. 15% of google's search queries are unique, so relying on a pre-generated list of possible queries, even with every previously searched query, would still leave you with a large chunk without a good result.