Closed Extravi closed 11 months ago
def makeHTMLRequest(url: str):
# block unwanted request from an edited cookie
domain = unquote(url).split('/')[2]
if domain not in WHITELISTED_DOMAINS:
raise Exception(f"The domain '{domain}' is not whitelisted.")
# get google cookies
with open("./2captcha.json", "r") as file:
data = json.load(file)
GOOGLE_OGPC_COOKIE = data["GOOGLE_OGPC_COOKIE"]
GOOGLE_NID_COOKIE = data["GOOGLE_NID_COOKIE"]
GOOGLE_AEC_COOKIE = data["GOOGLE_AEC_COOKIE"]
GOOGLE_1P_JAR_COOKIE = data["GOOGLE_1P_JAR_COOKIE"]
GOOGLE_ABUSE_COOKIE = data["GOOGLE_ABUSE_COOKIE"]
# Choose a user-agent at random
user_agent = random.choice(user_agents)
headers = {
"User-Agent": user_agent,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Dnt": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1"
}
cookies = {
"OGPC": f"{GOOGLE_OGPC_COOKIE}",
"NID": f"{GOOGLE_NID_COOKIE}",
"AEC": f"{GOOGLE_AEC_COOKIE}",
"1P_JAR": f"{GOOGLE_1P_JAR_COOKIE}",
"GOOGLE_ABUSE_EXEMPTION": f"{GOOGLE_ABUSE_COOKIE}"
}
# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False
# Grab HTML content with the specific cookie
html = requests.get(url, headers=headers, cookies=cookies)
# Return the BeautifulSoup object
return BeautifulSoup(html.text, "lxml")
this might be a useful cookie to add
Here's a few changes I'd add:
from urllib.parse import urlparse
# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False
def makeHTMLRequest(url: str, is_google=False):
# block unwanted request from an edited cookie
domain = urlparse(url).netloc
if domain not in WHITELISTED_DOMAINS:
raise Exception(f"The domain '{domain}' is not whitelisted.")
if is_google:
# get google cookies
with open("./2captcha.json", "r") as file:
data = json.load(file)
cookies = {
"OGPC": data["GOOGLE_OGPC_COOKIE"],
"NID": data["GOOGLE_NID_COOKIE"],
"AEC": data["GOOGLE_AEC_COOKIE"],
"1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
"GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
}
else:
cookies = {}
headers = {
"User-Agent": random.choice(user_agents), # Choose a user-agent at random
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Dnt": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1"
}
# Grab HTML content with the specific cookie
html = requests.get(url, headers=headers, cookies=cookies)
# Return the BeautifulSoup object
return BeautifulSoup(html.text, "lxml")
This sets requests.packages.urllib3.util.connection.HAS_IPV6
before the function, because it only needs to be set once.
Uses urlparse rather than splitting strings.
Only uses the cookies if the the function is called as makeHTMLRequest(url, is_google=True), so other requests don't send unnecessary cookies, and don't waste time parsing the file.
And removes a few one time use variables because they don't need to be variables.
yeah I'm still working on that request function
there will be more changes in the next few days
Also, the Accept header accepts */*
, so all the other MIME types don't need to be specified.
I'm running tests on various capacha blocked vpn connections to see what headers and cookies that will make the request more reliable
Also, the Accept header accepts
*/*
, so all the other MIME types don't need to be specified.
I'm still going to specify it just in case and I'll continue to run test
2captcha is very cheap but it adds up overtime so I need to make it harder to detect and block so it uses the API less
once the first recaptca pops up it's going to pop up more often I noticed so I need to find ways to make the request system seem like a real user
I did notice that a headless chrome browser doesn't really use that much memory and there are undetected versions of it so that could become a scraping option in the config at some point
nvm that might not be practical
odd i cant seem to get the "_GRECAPTCHA" cookie
oh in chrome based browsers its not stored under cookies its stored under local storage
Does 2captcha also work for self hosted instances without the hoster having to pay?
oh in chrome based browsers its not stored under cookies its stored under local storage
odd i cant get it atm
Does 2captcha also work for self hosted instances without the hoster having to pay?
no but if you want to help test i can send you come credits
Does 2captcha also work for self hosted instances without the hoster having to pay?
if you make an account email me the email you used and i can send some credit that could help
some cookies used by google are region based so in the UK you won't get everything i can get testing in NA
but "_GRECAPTCHA" is in the EU, UK and NA
i also do my test using high load free vpn servers to make sure it has a recaptcha i send a request in a private window using "https://www.google.com/search?q=google"
I think there should be an option in the config file for if you want to use a captcha solver then. Maybe have something like #106 (the PR isn't that great, so I might redo it) for if the admin chooses not to use a captcha solver. Having to pay will probably turn most people away from self hosting.
its already an option in the config
its turned off by default but i have it on for testing
i have done a total of 182 captchas in my test and only used 0.54 cents
most of this is from debugging the code on a instance it will use the api far less
Do you think an alternative search engine would be good for people who chose not to have it though? Some people might just not want to give their credit card info to them. Also, how accurate is it? Does it usually work on the first attempt at solving, or does it take several? I know a lot of sites are now using AI generated captchas, which might cause issues for if google starts using a different captcha.
It does it in its first attempt out of the 182 sent; it got one wrong for the server to do everything with it and the web driver; it totals 43.99 seconds.
results will look something like this in the file
Do you think an alternative search engine would be good for people who chose not to have it though? Some people might just not want to give their credit card info to them. Also, how accurate is it? Does it usually work on the first attempt at solving, or does it take several? I know a lot of sites are now using AI generated captchas, which might cause issues for if google starts using a different captcha.
Googles recaptcha also uses ai btw
Here's a few changes I'd add:
from urllib.parse import urlparse # Force all requests to only use IPv4 requests.packages.urllib3.util.connection.HAS_IPV6 = False def makeHTMLRequest(url: str, is_google=False): # block unwanted request from an edited cookie domain = urlparse(url).netloc if domain not in WHITELISTED_DOMAINS: raise Exception(f"The domain '{domain}' is not whitelisted.") if is_google: # get google cookies with open("./2captcha.json", "r") as file: data = json.load(file) cookies = { "OGPC": data["GOOGLE_OGPC_COOKIE"], "NID": data["GOOGLE_NID_COOKIE"], "AEC": data["GOOGLE_AEC_COOKIE"], "1P_JAR": data["GOOGLE_1P_JAR_COOKIE"], "GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"] } else: cookies = {} headers = { "User-Agent": random.choice(user_agents), # Choose a user-agent at random "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8", "Accept-Encoding": "gzip, deflate", "Accept-Language": "en-US,en;q=0.5", "Dnt": "1", "Sec-Fetch-Dest": "document", "Sec-Fetch-Mode": "navigate", "Sec-Fetch-Site": "none", "Sec-Fetch-User": "?1", "Upgrade-Insecure-Requests": "1" } # Grab HTML content with the specific cookie html = requests.get(url, headers=headers, cookies=cookies) # Return the BeautifulSoup object return BeautifulSoup(html.text, "lxml")
This sets
requests.packages.urllib3.util.connection.HAS_IPV6
before the function, because it only needs to be set once. Uses urlparse rather than splitting strings. Only uses the cookies if the the function is called as makeHTMLRequest(url, is_google=True), so other requests don't send unnecessary cookies, and don't waste time parsing the file. And removes a few one time use variables because they don't need to be variables.
It's been added
from urllib.parse import unquote, urlparse
# Force all requests to only use IPv4
requests.packages.urllib3.util.connection.HAS_IPV6 = False
def makeHTMLRequest(url: str, is_google=False):
# block unwanted request from an edited cookie
domain = unquote(url).split('/')[2]
if domain not in WHITELISTED_DOMAINS:
raise Exception(f"The domain '{domain}' is not whitelisted.")
if is_google:
# get google cookies
data = load_config()
cookies = {
"OGPC": data["GOOGLE_OGPC_COOKIE"],
"NID": data["GOOGLE_NID_COOKIE"],
"AEC": data["GOOGLE_AEC_COOKIE"],
"1P_JAR": data["GOOGLE_1P_JAR_COOKIE"],
"GOOGLE_ABUSE_EXEMPTION": data["GOOGLE_ABUSE_COOKIE"]
}
else:
cookies = {}
headers = {
"User-Agent": random.choice(user_agents), # Choose a user-agent at random
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.5",
"Dnt": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1"
}
# Grab HTML content with the specific cookie
html = requests.get(url, headers=headers, cookies=cookies)
# Return the BeautifulSoup object
return BeautifulSoup(html.text, "lxml")
alternative search engine
an alternative search engine is not a bad idea and i will be looking into that soon but i want to finish how request are made first
im going to add support to proxy google autocomplete as a setting because its faster then duckduckgos
Each domain will now have its own persistent session, so I won't need to establish a new https/tls connection for each domain, and I can take advantage of connection reuse. This should greatly improve speeds. Also, each session will be isolated and have its own cookies, etc., making everything more reliable.
Video demo of what's possible with persistent sessions and connection reuse. Persistence sessions have already been added to my instance, but I cannot take advantage of connection reuse unless I set up a persistent session for each domain, and that's something I am currently working on. https://github.com/Extravi/araa-search/assets/98912029/96a7d011-9efe-4e03-9120-578760f97b77
A good example is autocomplete. Instead of opening a new TLS/SSL connection for every input or request, it can just resume its connection to that domain. This will greatly reduce delay and improve response time.
i will need to check each request and they will each need their own persistent session and each session is in memory/ram so its quite fast
ill add it tmr with some other stuff
@amogusussy @TEMtheLEM
This change should add more redundancy and make everything faster and more reliable.
the first request will look something like this any request after will look like this
Now there is no need to start a new request every time, saving on request response time and making everything faster.
image at first request any request after
If you have any ideas on how I can further make the request better or faster, let me know.
Requests made using makeHTMLRequest should look something like this to the server, making them more reliable, and when possible, it will send available cookies to Google if 2Captcha support is enabled.