Extravi / araa-search

A privacy-respecting, ad-free, self-hosted Google metasearch engine with strong security that offers full API support and utilizes Qwant for images, and DuckDuckGo for auto-complete.
https://araa.extravi.dev
GNU Affero General Public License v3.0
257 stars 23 forks source link

Changes to how requests are made. #108

Closed Extravi closed 11 months ago

Extravi commented 11 months ago

Requests made using makeHTMLRequest should look something like this to the server, making them more reliable, and when possible, it will send available cookies to Google if 2Captcha support is enabled. image

Extravi commented 11 months ago

def makeHTMLRequest(url: str):
    # block unwanted request from an edited cookie
    domain = unquote(url).split('/')[2]
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    # get google cookies
    with open("./2captcha.json", "r") as file:
        data = json.load(file)
    GOOGLE_OGPC_COOKIE = data["GOOGLE_OGPC_COOKIE"]
    GOOGLE_NID_COOKIE = data["GOOGLE_NID_COOKIE"]
    GOOGLE_AEC_COOKIE = data["GOOGLE_AEC_COOKIE"]
    GOOGLE_1P_JAR_COOKIE = data["GOOGLE_1P_JAR_COOKIE"]
    GOOGLE_ABUSE_COOKIE = data["GOOGLE_ABUSE_COOKIE"]

    # Choose a user-agent at random
    user_agent = random.choice(user_agents)
    headers = {
        "User-Agent": user_agent,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }

    cookies = {
        "OGPC": f"{GOOGLE_OGPC_COOKIE}",
        "NID": f"{GOOGLE_NID_COOKIE}",
        "AEC": f"{GOOGLE_AEC_COOKIE}",
        "1P_JAR": f"{GOOGLE_1P_JAR_COOKIE}",
        "GOOGLE_ABUSE_EXEMPTION": f"{GOOGLE_ABUSE_COOKIE}"
    }

    # Force all requests to only use IPv4
    requests.packages.urllib3.util.connection.HAS_IPV6 = False

    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")
Extravi commented 11 months ago

this might be a useful cookie to add image

amogusussy commented 11 months ago

Here's a few changes I'd add:

from urllib.parse import urlparse

# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False

def makeHTMLRequest(url: str, is_google=False):
    # block unwanted request from an edited cookie
    domain = urlparse(url).netloc
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    if is_google:
        # get google cookies
        with open("./2captcha.json", "r") as file:
            data = json.load(file)
        cookies = {
            "OGPC": data["GOOGLE_OGPC_COOKIE"],
            "NID": data["GOOGLE_NID_COOKIE"],
            "AEC": data["GOOGLE_AEC_COOKIE"],
            "1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
            "GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
        }
    else:
        cookies = {}

    headers = {
        "User-Agent": random.choice(user_agents),  # Choose a user-agent at random
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }

    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")

This sets requests.packages.urllib3.util.connection.HAS_IPV6 before the function, because it only needs to be set once. Uses urlparse rather than splitting strings. Only uses the cookies if the the function is called as makeHTMLRequest(url, is_google=True), so other requests don't send unnecessary cookies, and don't waste time parsing the file. And removes a few one time use variables because they don't need to be variables.

Extravi commented 11 months ago

yeah I'm still working on that request function

Extravi commented 11 months ago

there will be more changes in the next few days

amogusussy commented 11 months ago

Also, the Accept header accepts */*, so all the other MIME types don't need to be specified.

Extravi commented 11 months ago

I'm running tests on various capacha blocked vpn connections to see what headers and cookies that will make the request more reliable

Extravi commented 11 months ago

Also, the Accept header accepts */*, so all the other MIME types don't need to be specified.

I'm still going to specify it just in case and I'll continue to run test

Extravi commented 11 months ago

2captcha is very cheap but it adds up overtime so I need to make it harder to detect and block so it uses the API less

Extravi commented 11 months ago

once the first recaptca pops up it's going to pop up more often I noticed so I need to find ways to make the request system seem like a real user

Extravi commented 11 months ago

I did notice that a headless chrome browser doesn't really use that much memory and there are undetected versions of it so that could become a scraping option in the config at some point

Extravi commented 11 months ago

nvm that might not be practical

Extravi commented 11 months ago

odd i cant seem to get the "_GRECAPTCHA" cookie

Extravi commented 11 months ago

oh in chrome based browsers its not stored under cookies its stored under local storage

amogusussy commented 11 months ago

Does 2captcha also work for self hosted instances without the hoster having to pay?

Extravi commented 11 months ago

oh in chrome based browsers its not stored under cookies its stored under local storage

odd i cant get it atm image

Extravi commented 11 months ago

Does 2captcha also work for self hosted instances without the hoster having to pay?

no but if you want to help test i can send you come credits

Extravi commented 11 months ago

Does 2captcha also work for self hosted instances without the hoster having to pay?

image

Extravi commented 11 months ago

if you make an account email me the email you used and i can send some credit that could help

Extravi commented 11 months ago

some cookies used by google are region based so in the UK you won't get everything i can get testing in NA

Extravi commented 11 months ago

but "_GRECAPTCHA" is in the EU, UK and NA

Extravi commented 11 months ago

i also do my test using high load free vpn servers to make sure it has a recaptcha i send a request in a private window using "https://www.google.com/search?q=google" image

amogusussy commented 11 months ago

I think there should be an option in the config file for if you want to use a captcha solver then. Maybe have something like #106 (the PR isn't that great, so I might redo it) for if the admin chooses not to use a captcha solver. Having to pay will probably turn most people away from self hosting.

Extravi commented 11 months ago

its already an option in the config

Extravi commented 11 months ago

its turned off by default but i have it on for testing

image

Extravi commented 11 months ago

i have done a total of 182 captchas in my test and only used 0.54 cents image

Extravi commented 11 months ago

most of this is from debugging the code on a instance it will use the api far less

amogusussy commented 11 months ago

Do you think an alternative search engine would be good for people who chose not to have it though? Some people might just not want to give their credit card info to them. Also, how accurate is it? Does it usually work on the first attempt at solving, or does it take several? I know a lot of sites are now using AI generated captchas, which might cause issues for if google starts using a different captcha.

Extravi commented 11 months ago

It does it in its first attempt out of the 182 sent; it got one wrong for the server to do everything with it and the web driver; it totals 43.99 seconds. image

Extravi commented 11 months ago

results will look something like this in the file image

Extravi commented 11 months ago

Do you think an alternative search engine would be good for people who chose not to have it though? Some people might just not want to give their credit card info to them. Also, how accurate is it? Does it usually work on the first attempt at solving, or does it take several? I know a lot of sites are now using AI generated captchas, which might cause issues for if google starts using a different captcha.

Googles recaptcha also uses ai btw

Extravi commented 11 months ago

Here's a few changes I'd add:

from urllib.parse import urlparse

# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False

def makeHTMLRequest(url: str, is_google=False):
    # block unwanted request from an edited cookie
    domain = urlparse(url).netloc
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    if is_google:
        # get google cookies
        with open("./2captcha.json", "r") as file:
            data = json.load(file)
        cookies = {
            "OGPC": data["GOOGLE_OGPC_COOKIE"],
            "NID": data["GOOGLE_NID_COOKIE"],
            "AEC": data["GOOGLE_AEC_COOKIE"],
            "1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
            "GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
        }
    else:
        cookies = {}

    headers = {
        "User-Agent": random.choice(user_agents),  # Choose a user-agent at random
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }

    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")

This sets requests.packages.urllib3.util.connection.HAS_IPV6 before the function, because it only needs to be set once. Uses urlparse rather than splitting strings. Only uses the cookies if the the function is called as makeHTMLRequest(url, is_google=True), so other requests don't send unnecessary cookies, and don't waste time parsing the file. And removes a few one time use variables because they don't need to be variables.

It's been added


from urllib.parse import unquote, urlparse

# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False

def makeHTMLRequest(url: str, is_google=False):
    # block unwanted request from an edited cookie
    domain = unquote(url).split('/')[2]
    if domain not in WHITELISTED_DOMAINS:
        raise Exception(f"The domain '{domain}' is not whitelisted.")

    if is_google:
        # get google cookies
        data = load_config()
        cookies = {
            "OGPC": data["GOOGLE_OGPC_COOKIE"],
            "NID": data["GOOGLE_NID_COOKIE"],
            "AEC": data["GOOGLE_AEC_COOKIE"],
            "1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
            "GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
        }
    else:
        cookies = {}

    headers = {
        "User-Agent": random.choice(user_agents),  # Choose a user-agent at random
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "en-US,en;q=0.5",
        "Dnt": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1"
    }

    # Grab HTML content with the specific cookie
    html = requests.get(url, headers=headers, cookies=cookies)

    # Return the BeautifulSoup object
    return BeautifulSoup(html.text, "lxml")
Extravi commented 11 months ago

alternative search engine

an alternative search engine is not a bad idea and i will be looking into that soon but i want to finish how request are made first

Extravi commented 11 months ago

im going to add support to proxy google autocomplete as a setting because its faster then duckduckgos image

Extravi commented 11 months ago

image

Extravi commented 11 months ago

Each domain will now have its own persistent session, so I won't need to establish a new https/tls connection for each domain, and I can take advantage of connection reuse. This should greatly improve speeds. Also, each session will be isolated and have its own cookies, etc., making everything more reliable.

Extravi commented 11 months ago

Video demo of what's possible with persistent sessions and connection reuse. Persistence sessions have already been added to my instance, but I cannot take advantage of connection reuse unless I set up a persistent session for each domain, and that's something I am currently working on. https://github.com/Extravi/araa-search/assets/98912029/96a7d011-9efe-4e03-9120-578760f97b77

Extravi commented 11 months ago

A good example is autocomplete. Instead of opening a new TLS/SSL connection for every input or request, it can just resume its connection to that domain. This will greatly reduce delay and improve response time.

Extravi commented 11 months ago

image

i will need to check each request and they will each need their own persistent session and each session is in memory/ram so its quite fast

Extravi commented 11 months ago

ill add it tmr with some other stuff

Extravi commented 11 months ago

@amogusussy @TEMtheLEM image

Extravi commented 11 months ago

This change should add more redundancy and make everything faster and more reliable.

Extravi commented 11 months ago

the first request will look something like this image any request after will look like this image

Extravi commented 11 months ago

Now there is no need to start a new request every time, saving on request response time and making everything faster.

Extravi commented 11 months ago

image at first request image image any request after image image

Extravi commented 11 months ago

If you have any ideas on how I can further make the request better or faster, let me know.