Closed Extravi closed 8 months ago
i might have to do something like this and update the image in real time because its not simple to scrape
it will first display that to the user because each user needs its own session to prevent more then one user doing the same captcha at once
I think a good solution could be having a backup search engine. When the instance gets rate limited, it should just switch to another engine, like qwant, then wait ~30 minutes, and retry google. Then repeat until google stops rate limiting.
I think a good solution could be having a backup search engine. When the instance gets rate limited, it should just switch to another engine, like qwant, then wait ~30 minutes, and retry google. Then repeat until google stops rate limiting.
I already wrote most of the code for the captcha proxy already
also look at librey and its fall back system it's not necessarily clean or well done
I will likely add support for other engines at some point but the user should be able to use Google if they want rate limited or not
that's why I'm working on the proxy
@amogusussy i found a better way to proxy the captcha using sessions and a iframe sending data to the iframe from the server like the sitekey and s-data
Have you tried that with a different device? If you send an iframe to the user's device, whatever site it's loading will just think it's a request from the user, so it wont be rate limited. This will probably only seem like it's working, since you're testing it on the device that's rate limited. A non rate limited user will just get sent the normal page.
Have you tried that with a different device? If you send an iframe to the user's device, whatever site it's loading will just think it's a request from the user, so it wont be rate limited. This will probably only seem like it's working, since you're testing it on the device that's rate limited. A non rate limited user will just get sent the normal page.
yes im aware of the normal page thing i have been testing for hours ill be fine once its out/done
its going to need an entire local proxy server for this to work https://mitmproxy.org/ i found this but do you know any better http proxies?
recaptcha needs to be done using the servers ips so i need to proxy everything to the end user
I've found this list of alternatives for Linux, but I don't really know what makes a proxy better/worse.
I think I should use a paid Captcha solver service
because it's not necessarily practical to proxy it to my users or even possible
i'm going to drop this for now and add support for a different engine as a backup
any ideas for what engine i should use for the backup?
Also, I will be implementing the backup engine, so there is a template to build off of.
i want everything to look like it belongs unlike LibreY and its broken system
a captcha proxy cost less then google search api
Qwant has a free API. The only problem is that it doesn't show the wikipedia results in the API, so you'll have to scrape that yourself. There's also DuckDuckGo, Startpage, yahoo, and brave. If you need help with scraping them, you can have a look at SearXNG's source code, since all the engines I mentioned are included in SearXNG.
I think we should standardize the results into one dict/json object, like what's been done with the torrent results. If we do that, it'll be 10x easier to add new engines, and maybe even give the user the ability to choose what engines they want to use.
question what do you think of anonymous data submission of search results as an opt in future it would only collect the sub domain and domain for each result for the query but it wouldn't record the query itself so like www.youtube.com or GitHub.com nothing after the /
I would use that data to index and improve aspects of the search results and make results more visual
such as favicon indexing etc I might collect YouTube channel URLs too so the / for that but that's so I can index all channels over 10k subscribers
so i can do things like this and this
the data collection code would be open source and anonymous
and if the user wants to opt out of the setting turned on by default in settings they can
i want to index some stuff that each engine can use like qwant google etc in Araa
i want to make results look more visual and modern
I want it to be on par with closed-source meta search engines, and for that to work, some data collection may be required.
its only an idea and does not mean it will happen
its something i want to do but if i do decide to develop it something might change resulting in it getting dropped
Something like that seems too far out from the reach of this project.
I do think a feature like that could be good through. If there's a link within the first 3 results that links to youtube.com/c/
(to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel.
That could also expand further into other widgets, like for weather, or sports results.
it's not necessarily far out of reach it's common to come across the same websites in the search results for different queries and many people will search/request the same websites time from time so after the first request it will index the favicon etc and pair it with that sub domain and domain
like medium and other articles sites are quite common or even stack overflow for a coding related query
most people only really go to the top 1000 or so sites and it will naturally index's information for thos sites overtime and many other sites
it may seem far out of reach but when you really think about it and user habits it isn't impossible
the indexer application would have to be a separate project from this repo and this repo would only use the data it produces using the collected data from this repository
Something like that seems too far out from the reach of this project. I do think a feature like that could be good through. If there's a link within the first 3 results that links to
youtube.com/c/
(to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel. That could also expand further into other widgets, like for weather, or sports results.
also due to speed it will only show that data after it has it indexed by the other application
the indexer might be MIT or something I'm not sure but any data it produces likely won't be subject to GPL
Something like that seems too far out from the reach of this project. I do think a feature like that could be good through. If there's a link within the first 3 results that links to
youtube.com/c/
(to check if it's likely that the user's searching for that channel), then you could scrape social blade for info about the channel. That could also expand further into other widgets, like for weather, or sports results.
yes I want to add things like weather news and spots thos are also topics I want to index
I wouldn't index text results because it's more compacted then news or other topics/subjects
I would only get data to associate with text results
This is a local autocomplete demo with some data collection. It could improve a ton, and then there would be no need to relay on DuckDuckGo, making it faster with some optimization. https://github.com/Extravi/araa-search/assets/98912029/9227502a-7a65-43b1-8e90-4912c157a86a
This is a local autocomplete demo with some data collection. It could improve a ton, and then there would be no need to relay on DuckDuckGo, making it faster with some optimization. https://github.com/Extravi/araa-search/assets/98912029/9227502a-7a65-43b1-8e90-4912c157a86a
read how many lines it has of right now
I'm more talking about how we'd need to use things like databases for all the favicons. If you want good speeds for it, you'll need to use a dedicated database, like SQLite, rather than using python dicts. It probably could be done, but it might take a bit of time to do it right.
Does the search suggestions deal with misspelled words? If I go to DuckDuckGo and type 'liux', it gives a suggestion of 'linux', because it can guess what I was probably going for. Does this have anything similar yet?
I'm more talking about how we'd need to use things like databases for all the favicons. If you want good speeds for it, you'll need to use a dedicated database, like SQLite, rather than using python dicts. It probably could be done, but it might take a bit of time to do it right.
Does the search suggestions deal with misspelled words? If I go to DuckDuckGo and type 'liux', it gives a suggestion of 'linux', because it can guess what I was probably going for. Does this have anything similar yet?
yes it does check for spelling
That looks good then. I think it should still keep duckduckgo by default though, unless you make a way for it to actually guess what the user's going to type, beside using a list to look it up. 15% of google's search queries are unique, so relying on a pre-generated list of possible queries, even with every previously searched query, would still leave you with a large chunk without a good result.
I'm working on a system that allows users to interact with reCAPTCHA. Whenever Araa gets rate-limited, it will then load a web driver to proxy the captcha, allowing users to interact with it. If the user successfully completes the captcha, the web driver will then capture the "GOOGLE_ABUSE_EXEMPTION=ID" cookie and send it in the request header to Google using makeHTMLRequest. Both SearXNG and other projects do not do this, so this will be the first.