Closed choeffer closed 2 years ago
interestingly, sometimes it is working. But the last days it crashes more often. Maybe they have changed their bot recognition?
Seems that I can get rid of the error by getting a new IP address.
But I still get
[2021/04/20 23:21:29|config.py |INFO ]: Using config /home/choeffer/Dokumente/flathunter/config.yaml
Traceback (most recent call last):
File "/home/choeffer/Dokumente/flathunter/flathunt.py", line 89, in <module>
main()
File "/home/choeffer/Dokumente/flathunter/flathunt.py", line 86, in main
launch_flat_hunt(config)
File "/home/choeffer/Dokumente/flathunter/flathunt.py", line 46, in launch_flat_hunt
hunter.hunt_flats()
File "/home/choeffer/Dokumente/flathunter/flathunter/hunter.py", line 42, in hunt_flats
for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
File "/home/choeffer/Dokumente/flathunter/flathunter/hunter.py", line 21, in crawl_for_exposes
return chain(*[searcher.crawl(url, max_pages)
File "/home/choeffer/Dokumente/flathunter/flathunter/hunter.py", line 21, in <listcomp>
return chain(*[searcher.crawl(url, max_pages)
File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 136, in crawl
return self.get_results(url, max_pages)
File "/home/choeffer/Dokumente/flathunter/flathunter/crawl_immobilienscout.py", line 60, in get_results
soup = self.get_page(search_url, self.driver, page_no)
File "/home/choeffer/Dokumente/flathunter/flathunter/crawl_immobilienscout.py", line 120, in get_page
return self.get_soup_from_url(search_url.format(page_no), driver=driver, captcha_api_key=self.captcha_api_key, checkbox=self.checkbox, afterlogin_string=self.afterlogin_string)
File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 75, in get_soup_from_url
self.resolvecaptcha(driver, checkbox, afterlogin_string, captcha_api_key)
File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 151, in resolvecaptcha
iframe_present = self._check_if_iframe_visible(driver)
File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 207, in _check_if_iframe_visible
iframe = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
File "/home/choeffer/Dokumente/flathunter/venv/lib64/python3.9/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
When enabling 100% recognition at 2 captcha
Also often same error as above without using 100% recognition at 2 captcha. Any ideas?
Regarding the 2captcha support, 100% recognition feature only works with Normal captcha.
Interestingly, sometimes I can still see new flats and it does not crash. But I have not found any pattern. I will investigate further.
Also found that this started to happen very regularly since 19.04
same here, maybe it is a issue with their change to a new captcha system? since one week ago there appear other captchas like before ..
@pneismeis I have the same assumption. And it might be that a captcha is recognized but the old pattern for recaptcha v2 is used. Therefore, the programm is waiting for a response it will never get.
File "/home/choeffer/Dokumente/flathunter/flathunter/crawl_immobilienscout.py", line 120, in get_page
return self.get_soup_from_url(search_url.format(page_no), driver=driver, captcha_api_key=self.captcha_api_key, checkbox=self.checkbox, afterlogin_string=self.afterlogin_string)
File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 75, in get_soup_from_url
self.resolvecaptcha(driver, checkbox, afterlogin_string, captcha_api_key)
File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 151, in resolvecaptcha
iframe_present = self._check_if_iframe_visible(driver)
File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 207, in _check_if_iframe_visible
iframe = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
I think we could solve it by matching the new captcha system which immoscout uses, maybe with a switch for recaptcha and the new captcha system.
I think https://2captcha.com/de/2captcha-api#solving_geetest needs to be implemented for immoscout after investigating the new captcha site content of immoscout. I was able to find the gt
and challange
values there.
In https://github.com/flathunters/flathunter/blob/main/flathunter/abstract_crawler.py#L160-L182 there are some POST GET requests which needs to be modified I guess. But I haven't understood the whole construct of flathunter around this function so far.
Hey @choeffer ,
Thanks for investigating that - it really helps a lot to have someone report the issue and do some investigating. I'll try and take a look at a fix in the coming days.
This part seems to contain the relevant code on the bot protection site:
<script src='https://www.google.com/recaptcha/api.js?hl=de'></script>
<script src="https://static.geetest.com/static/tools/gt.js"></script>
<script>
initGeetest({
gt: "0fdbade8a0fe41cba0ff758456d23dfa",
challenge: "5b64391babf2bc5a6b2d9a8340cd6399",
offline: false,
new_captcha: true,
lang: window.geetestLang || "en",
}, function (captchaObj) {
captchaObj.onSuccess(function () {
var obj = captchaObj.getValidate();
solvedCaptcha({
geetest_challenge: obj.geetest_challenge,
geetest_seccode: obj.geetest_seccode,
geetest_validate: obj.geetest_validate,
data: "3:X41YXeKEoY0Jt0g2trLvbg==:/iA+r889CvCKwh46gxWwkl1izbJlcVCnnU54hH/WLFm69/FkZEjLxcTiMnxho+Rf:YZ/wjiT5RG6qmrxPKDCpTrpoB+jZfHm259Ys8WNH71Q="
});
});
captchaObj.appendTo('#captcha-box');
});
</script>
<script>
function solvedCaptcha(payload) {
const timeoutMs = 10000;
protectionSubmitCaptcha("geetest", payload, timeoutMs, "3:ISKYPxyVqWelP+kqAzjRkg==:W3jUem1HommRbe3pRu6ZAlFZGCt5pbcLOTcmK7jsFzF9Pa+Wd+KxEpqATLDsObJm5H8SFp0FslvUQFssA0Jo/broaq7x/D42lyFauv5P+yQFjfk98ioAdzqUNu1kn/B+rAy3jOyWoCvzvn2lalTp09UMvb9PjwRKL+mUWLlft2nqX54cbQHxb762Awms0LqJ:d+muV3Xl0DmwjVBUgrIEuK64ZGO0gzWL6etLd2BugSA=")
.then(
function() {
window.location.reload(true);
},
function(error) {
console.log(error);
},
);
}
</script>
@codders I hope this snippet helps you as well. Thanks again for your effort of fixing and maintaining this useful tool!
@codders Just one thought. Maybe it might be possible and useful to leave the captchav2 code in place and add the geetest code with a switch, as they might use both in parallel and decide from time to time which they deliver. So the program could decide on the fly which method to use for the delivered captcha (if this is possible and easy to implement).
https://github.com/2captcha/2captcha-python could help to integrate many solvers at once with a similar pattern without sending plain POST GET commands.
https://2captcha.com/de/p/geetest has some more information about geetest and a python code example. Somehow the infos on the 2captcha website are a bit cluttered.
And sometimes it is only needed to click a button to verify that you are not a robot. So they do not always roll-out the geetest puzzle. I will try to provide this code snippet as well if it appears the next time.
I had a look at this today. It was possible for me to detect that GeeTest was there, and also to do that without disabling the recaptcha support. Unfortunately, I get the ERROR_CAPCHA_UNSOLVABLE back from the 2captcha API whenever I submit a GeeTest token and challenge.
I think this is related to the detail in the 2captcha API docs:
Important: you should get a new challenge value for each request to our API. Once captcha was loaded on the page the challenge value becomes invalid. You should inspect requests made to the website when page is loaded to identify a request that gets a new challenge value. Then you should make such request each time to get a valid challenge value.
When the Selenium browser gets the GeeTest token and challenge, it's already been displayed in the browser and a bunch of other Javascript has already run, which means that the challenge is already invalidated. The design of that page is tricky - the challenge isn't there when the page first loads, and it shows up later after Javascript has run.
So it's a pretty nasty (and not much fun) reverse engineering problem to get those details in a clean way, and I haven't had any success so far. I'm very open to other people taking a shot at it, but I don't feel like I'm in a place where I can dig deeper into it myself right now.
Sorry for that. In case you / anyone is interested, I've attached what I was trying to do here.
diff --git a/flathunter/abstract_crawler.py b/flathunter/abstract_crawler.py
index 35a4c84..56b0df3 100644
--- a/flathunter/abstract_crawler.py
+++ b/flathunter/abstract_crawler.py
@@ -71,7 +71,13 @@ class Crawler:
return self.get_soup_with_proxy(url)
if driver is not None:
driver.get(url)
- if re.search("g-recaptcha", driver.page_source):
+ sleep(4)
+ self.__log__.debug("Checking geetest: %s" % driver.execute_script(f'return window.GeeChallenge'))
+ if re.search("initGeetest", driver.page_source):
+ self.__log__.debug("Found geetest captcha - attempting to solve")
+ self.resolvegeetestcaptcha(driver, captcha_api_key)
+ elif re.search("g-recaptcha", driver.page_source):
+ self.__log__.debug("Found recaptcha captcha - attempting to solve")
self.resolvecaptcha(driver, checkbox, afterlogin_string, captcha_api_key)
return BeautifulSoup(driver.page_source, 'html.parser')
return BeautifulSoup(resp.content, 'html.parser')
@@ -147,6 +153,12 @@ class Crawler:
"""Loads additional detalis for an expose. Should be implemented in the subclass"""
return expose
+ def resolvegeetestcaptcha(self, driver, api_key: str):
+ gt = re.search('gt: \"([^"]+)\",', driver.page_source)
+ challenge = re.search('challenge: \"([^"]+)\",', driver.page_source)
+ if (gt is not None and challenge is not None):
+ self._solve_geetest(driver, api_key, gt.group(1), challenge.group(1))
+
def resolvecaptcha(self, driver, checkbox: bool, afterlogin_string: str = "", api_key: str = None):
iframe_present = self._check_if_iframe_visible(driver)
if checkbox is False and afterlogin_string == "" and iframe_present:
@@ -180,6 +192,28 @@ class Crawler:
driver.execute_script(f'solvedCaptcha("{recaptcha_answer}")')
self._check_if_iframe_not_visible(driver)
+ def _solve_geetest(self, driver, api_key: str, gt: str, challenge: str):
+ url = driver.current_url
+ self.__log__.debug(f"Attempting with gt: {gt} challenge: {challenge}")
+ session = requests.Session()
+ postrequest = (
+ f"http://2captcha.com/in.php?key={api_key}&method=geetest>={gt}&challenge={challenge}&pageurl={url}"
+ )
+ captcha_id = session.post(postrequest).text.split("|")[1]
+ geetest_answer = session.get(f"http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}").text
+ while "CAPCHA_NOT_READY" in geetest_answer:
+ sleep(5)
+ self.__log__.debug("Captcha status: %s", geetest_answer)
+ geetest_answer = session.get(f"http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}").text
+ self.__log__.debug("Captcha promise: %s", geetest_answer)
+# recaptcha_answer = recaptcha_answer.split("|")[1]
+# driver.execute_script(f'document.getElementById("g-recaptcha-response").innerHTML="{recaptcha_answer}";')
+ # TODO: Below function call can be different depending on the websites implementation. It is responsible for
+ # sending the the promise that we get from recaptcha_answer. For now, if it breaks, it is required to
+ # reverse engineer it by hand. Not sure if there is a way to automate it.
+# driver.execute_script(f'solvedCaptcha("{recaptcha_answer}")')
+# self._check_if_iframe_not_visible(driver)
+
def _clickcaptcha(self, driver, checkbox: bool):
driver.switch_to.frame(driver.find_element_by_tag_name("iframe"))
recaptcha_checkbox = driver.find_element_by_class_name("recaptcha-checkbox-checkmark")
https://youtu.be/oUKBX0lleUY?t=149 seems like there are three hidden fields which needs to be filled out. And they are already present before the puzzle is loaded in a normal Firefox session. The puzzle only loads if you click the button. @codders does this info helps you? I From my understanding the hidden fields need to be filled out before executing any further scripts.
You could also use https://2captcha.com/demo/geetest to verify if the python code is working properly and if the problem is specific to immoscout. They provide a geetest captcha on that page.
This happens after clicking the button.
And after moving the slider it seems that the hidden fields are filled out and are submitted. (But I only could barely see it as it happened very fast).
As I said, I've taken a look and it's complicated, for exactly the reasons you're describing. If someone wants to take a deeper look using the clues in this thread, they are very welcome. I'm not available to dive deeper into this right now.
Ah okay, I thought it might help you. Thanks again for having a look at the issue. Maybe someone else has a good idea how to fix it.
maybe you could scrape immosuchmaschine.de since this page is also scraping ImmobilienScout24 and a few other relatively unknown pages
@codders I tried to dig a bit further with the help of your diff, see above mentioned commit. I used the python debugger and was able to verify your result. Interestingly, this also happens on https://2captcha.com/de/demo/geetest where I would expect it to work.
#Both taken from the website
(Pdb) gt = '81388ea1fc187e0c335c0a8907ff2625'
(Pdb) challenge = 'e4d5929ab1505b0b6a081244d2041403'
(Pdb) url = 'https://2captcha.com/de/demo/geetest'
(Pdb) session = requests.Session()
(Pdb) postrequest = (f"http://2captcha.com/in.php?key={api_key}&method=geetest>={gt}&challenge={challenge}&pageurl={url}")
(Pdb) postrequest
'http://2captcha.com/in.php?key=XYZ&method=geetest>=81388ea1fc187e0c335c0a8907ff2625&challenge=e4d5929ab1505b0b6a081244d2041403&pageurl=https://2captcha.com/de/demo/geetest'
(Pdb) captcha_id = session.post(postrequest).text.split("|")[1]
(Pdb) session.get(f"http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}").text
'ERROR_CAPTCHA_UNSOLVABLE'
I also found out that the value of gt seems not to change at all (on immoscout).
The magic seems to happen there. I placed a debugger here
...
def _solve_geetest(self, driver, api_key: str, gt: str, challenge: str):
url = driver.current_url
import pdb; pdb.set_trace()
...
and deactivated --headless
in config.yaml
.
After looking at the gt.js code I found out that
(Pdb) driver.execute_script('return window.GeeGT')
'0fdbade8a0fe41cba0ff758456d23dfa'
(Pdb) driver.execute_script('return window.GeeChallenge')
'9f2f53e4d928b619e2cfa468cafbfab9'
are passed to window.initGeetest = function (userConfig, callback) {
by userConfig
.
So the real question is, where is the challange generated? I am not familiar with JS, but for me it looks like the challenge is requested very early in the process and then passed to initGeetest
. Does this helps you to get further @codders ?
There seems also a mechanism in place which requests a new challenge after reloading the website. Searching for e2ac69af207a8ea398e7f2526961d6f1
revealed this two GET requests.
And this challenge reset functionality is triggered every nine minutes by default. It uses the old challenge and receives a new challenge. So we should also be able to use this method to receive a new challenge and submit the response to 2captcha as long as it is not triggered internally by some JS code and such gets invalid. But if we trigger it manually before the automatic reset and just store/use the response, it should be fine I guess.
There are always two requests made. First one to immoscout, then a second one to geetest. Lets investigate the first pair.
The request to immoscout looks like the following:
https://www.immobilienscout24.de/assets/immo-1-17?d=www.immobilienscout24.de
Response:
{"token":"3:cR7PM0jDQBCnsLosc0eDFw==:+MdZiMFUVJ+jRhuDq4/U3C44/JGF1km4dDm5OznBxm3NhahMbPpuPFoFb93HK14LXf+xvqOsCvWBlgybpHcgeCiNtCnCkLFTjDK6MJTogaJBW770R+2fNAplVCq64AMj78xqewNuT24Uu1lT8m95dx1OuJdB8DGYWks4snVrSeNQg6xg4ugX0VjXXmkbcpH/rloPJmBzJd3Am7iueuAN1OlZqfbwNBOAbRQAlEESU6cz93BCosnUzn2wWVkJ66jO84upI9viSCtRkB+Dqyc99ibodXpRC30xUOejPc94V7chV0qTRitDoictNW1Y2MNI3S4B7boQFqT93HuCj27m0tS24LwUBD0GfMxzC+Tr0myAblvvQYp11syZQK9eBDNh0paRM32yuHaKatG/wjBJJyueQVI5MdSYT8kOqohgeyVjO5mxGxAhiNX58hWOfTV0:4saj+Y7Q/zsplO2EECWWCmHTF7QUrcCYRVHuzC9AkNM=","renewInSec":680,"cookieDomain":".immobilienscout24.de"}
It contains the following:
"token":"3:cR7PM0jDQBCnsLosc0eDFw==:+MdZiMFUVJ+jRhuDq4/U3C44/JGF1km4dDm5OznBxm3NhahMbPpuPFoFb93HK14LXf+xvqOsCvWBlgybpHcgeCiNtCnCkLFTjDK6MJTogaJBW770R+2fNAplVCq64AMj78xqewNuT24Uu1lT8m95dx1OuJdB8DGYWks4snVrSeNQg6xg4ugX0VjXXmkbcpH/rloPJmBzJd3Am7iueuAN1OlZqfbwNBOAbRQAlEESU6cz93BCosnUzn2wWVkJ66jO84upI9viSCtRkB+Dqyc99ibodXpRC30xUOejPc94V7chV0qTRitDoictNW1Y2MNI3S4B7boQFqT93HuCj27m0tS24LwUBD0GfMxzC+Tr0myAblvvQYp11syZQK9eBDNh0paRM32yuHaKatG/wjBJJyueQVI5MdSYT8kOqohgeyVjO5mxGxAhiNX58hWOfTV0:4saj+Y7Q/zsplO2EECWWCmHTF7QUrcCYRVHuzC9AkNM="
no clue what this value is used for, it changes each repsonse"renewInSec":680
could indicate when a new reset request should be made, as seen above in https://github.com/flathunters/flathunter/issues/119#issuecomment-855050008"cookieDomain":".immobilienscout24.de"
seems to be static and sent by each requestThe first reset request to geetest looks like the following:
https://api.geetest.com/reset.php?gt=0fdbade8a0fe41cba0ff758456d23dfa&challenge=6d213da1d834769551af13cb808a9202&lang=de&w=BLDHwP6bcyA0Dbf2X3wvAAz6LZ(LrYedFaf74Ult6NgRbYbCdLQ3OKhj5OfLH4HA1ZP1DA(TtZhR6RGLlEnZpwM09e3VDO5drPM7o4hMCNbydw6fytKniZdckK7YZQY(P8RBA4d2uTPBCsQpA4vjHMK0qru5p6dQXk42GiDm)zhOuHa5HDFJGaPNydlm8zRejb8nmJcHh5)wmkYRXjbnBxc4vCRdBpFCI3WASRy(KGL7yCgWeE6uq4ozKoQkvAlCOjXTi3UM1iNJdIjT1057G7atogvlaCQNFbU(uAR7NPreqQBWLlYKQkyB0dszoEMw9t6SYxPHXbULQ80h(SSU9FU50TltT8YiFwhWtDj1SIU3rgMArSf3vjuwiD6r06CuEbK)JZBfgIGWA)N38WOwGQvaSWWPlQkhNLPFidSCU)OLsdI1mvRs8eSSrUWfVqi9v1yBnafzDQ8SmqadRRUGOBCZH3ydotxTnmche9apnwmXj)mgKlmQHXsYHrYSoU4rdnioDH0cwrAfy4hbRK9LR6(JYdfiJma2DuvI4IaAFjlDFvB5zu8SFdfsHt9u1W5CN(tdVEzep5YhmppoYCYbJwtv7pcggQ(uhBq43HqR0fh6S8BhBdDH2FttBFwjKzCrhj3qQsCAPAH(KCbIuyYEqAyC1eG)oI(MwO2AWSHfBrT5w1lqbPGDqAho(T76TPw4Df1piHUkLgQ)dGTrDQgmlrdN5uwp8fBv6iVz3FEf5d4kN5DzRXOwpEnYqUwmVtnAIAezQ36ROYcYFO0kpyxc9fkIknEs7A7bmNP6Rt4CyxVj4RZZwFRVF1yD)Zv2ogE51xPCQyimMcHVbyh(IrEl6LhFtOTCgBPlqGSzQuDx(BA7MAkP2Lf6y36t490oSk60W01rcPHvKCDKprbgRx8Ngw..691b23f5664dd8060df5bd0a4af6d2b56deb0f9ffe769ccd7d03ed19f9fccbff03445aee32a0a5fd0997d83f86df41b1d6d0c35e46081e1a0d9704c479e92591254984aa6a994ac23846ecb3b036fa541aa3c6eaf37c2b164bc4cca5e84b64b12ea54bcf095ae864a3cf3970166ad127f61bba2547bf4bca2e32506124f863cc&pt=0&client_type=web&callback=geetest_1622894289835
It contains the following:
gt=0fdbade8a0fe41cba0ff758456d23dfa
does not change, is known and can be retrieved by driver.execute_script(f'return window.GeeGT')
challenge=1ae62e97e6b53486aacb97f3ad4c6246
does change, is kown and can be retrieved by driver.execute_script(f'return window.GeeChallenge')
; interestingly this values does not change at all after new reset requests are made. This indicates that the first challenge value is used and stored forever after initializing(?) window.initGeetest
in gt.js
lang=de
should also be possible to retrieve and is used in the first get.php
request. But I do not know the impact of that valuew=
no clue how this value is calculated. Seems to be a hash and is changing for each new reset request.client_type=web
no clue where this value comes from, but I assume this one is staticcallback=geetest_1622894289835
seems to be the string geetest_
followed by a generated Epoch/Unix timestamp in milliseconds (maybe to calculate how long the challenge is valid?)Response:
geetest_1622894289835({"status": "success", "data": {"s": "2d323943", "c": [12, 58, 98, 36, 43, 95, 62, 15, 12], "challenge": "f0c3a0886226e2fb735fbe833d177665"}})
It contains the following:
geetest_1622894289835
generated and sent by the reset request"challenge": "f0c3a0886226e2fb735fbe833d177665"
received new value. Has to be stored/saved and cannot be retrieved by driver.execute_script(f'return window.GeeChallenge')
. This value is also used by the next reset request to retrieve a new challenge"s": "2d323943"
no clue right now"c": [12, 58, 98, 36, 43, 95, 62, 15, 12]
no clue right nowThe reset requests afterwards use the retrieved challenge value from the reset request before. So it should be possible to request a new token by saving the old challenge value. As seen in https://github.com/flathunters/flathunter/issues/119#issuecomment-855050008, this is also done by the webiste itself after a timer times out.
Conclusion
We need to find out how w=XYZ
(from the reset request) is calculated. Also, how both requests (and response from immscout before with "token":"XYZ"
) interact with each other, or if both are independent.
Any news on that problem? Or atleast a workaround so its not crashing ?
I got a workaround for not crashing and getting posts every 5th or so time: flathunter/abstract_crawler.py -> Line 204 find the method and add the last 2 lines
def _check_if_iframe_visible(self, driver: selenium.webdriver.Chrome):
try:
iframe = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
(By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
return iframe
except NoSuchElementException:
print("No iframe found, therefore no chaptcha verification necessary")
except selenium.common.exceptions.TimeoutException:
print("Timeout on recaptcha")
@BananaMinion thanks for digging further. At least this proves, that after some tries/time, they still deliver google recaptcha and not only geetest.
So a workaround/hack could be to reload the page until a google recaptcha is loaded(?) Then there would be no need to wait for the next round of visiting immoscout, just hoping to get a recaptcha that time.
Im not sure if they switch to the google recaptcha or they just dont show any capture. I could print a debugg message to see if the Captcha is loaded or not
I am getting the same.
Im not sure if they switch to the google recaptcha or they just dont show any capture.
I don't get any recaptcha anymore, only GeeCaptcha.
If i understand correctly, flathunter just has to be updated to support geetest, right? https://2captcha.com/2captcha-api#solving_geetest
Yeah, looks like it. If someone with more python skills could add this? I can only php :D
Well, im a step closer but need help now. To bypass GeeTest u need to get the challange and the gt token. Immobilienscout generates the challenge with an ajax call which generates some random function. If anyone has a clue how to get the challenge from this call https://static.geetest.com/static/js/fullpage.9.0.7.js i can implement the GeeTest bypass
I'm experiencing the same here with the Google ReCaptcha error:
(By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
File "~/.local/share/virtualenvs/flathunter-HYqahW9g/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until raise TimeoutException(message, screen, stacktrace)
Full verbose output:
ā flathunter git:(main) pipenv run python flathunt.py
[2021/11/18 13:41:08|config.py |INFO ]: Using config ~/flathunter/config.yaml
[2021/11/18 13:41:09|flathunt.py |DEBUG ]: Settings from config: <flathunter.config.Config object at 0x10a8c9b50>
[2021/11/18 13:41:09|crawl_immobilienscout.py|DEBUG ]: Got search URL https://www.immobilienscout24.de/Suche/shape/(...)
Traceback (most recent call last):
File "flathunt.py", line 95, in <module>
main()
File "flathunt.py", line 92, in main
launch_flat_hunt(config)
File "flathunt.py", line 47, in launch_flat_hunt
hunter.hunt_flats()
File "~/flathunter/flathunter/hunter.py", line 42, in hunt_flats
for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
File "~/flathunter/flathunter/hunter.py", line 22, in crawl_for_exposes
for searcher in self.config.searchers()
File "~/flathunter/flathunter/hunter.py", line 23, in <listcomp>
for url in self.config.get('urls', list())])
File "~/flathunter/flathunter/abstract_crawler.py", line 136, in crawl
return self.get_results(url, max_pages)
File "~/flathunter/flathunter/crawl_immobilienscout.py", line 60, in get_results
soup = self.get_page(search_url, self.driver, page_no)
File "~/flathunter/flathunter/crawl_immobilienscout.py", line 120, in get_page
return self.get_soup_from_url(search_url.format(page_no), driver=driver, captcha_api_key=self.captcha_api_key, checkbox=self.checkbox, afterlogin_string=self.afterlogin_string)
File "~/flathunter/flathunter/abstract_crawler.py", line 75, in get_soup_from_url
self.resolvecaptcha(driver, checkbox, afterlogin_string, captcha_api_key)
File "~/flathunter/flathunter/abstract_crawler.py", line 151, in resolvecaptcha
iframe_present = self._check_if_iframe_visible(driver)
File "~/flathunter/flathunter/abstract_crawler.py", line 208, in _check_if_iframe_visible
(By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
File "~/.local/share/virtualenvs/flathunter-HYqahW9g/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
I'm experiencing the same here with the Google ReCaptcha error:
(By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
File "~/.local/share/virtualenvs/flathunter-HYqahW9g/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until raise TimeoutException(message, screen, stacktrace)
Full verbose output:
ā flathunter git:(main) pipenv run python flathunt.py
[2021/11/18 13:41:08|config.py |INFO ]: Using config ~/flathunter/config.yaml
[2021/11/18 13:41:09|flathunt.py |DEBUG ]: Settings from config: <flathunter.config.Config object at 0x10a8c9b50>
[2021/11/18 13:41:09|crawl_immobilienscout.py|DEBUG ]: Got search URL https://www.immobilienscout24.de/Suche/shape/(...)
Traceback (most recent call last):
File "flathunt.py", line 95, in <module>
main()
File "flathunt.py", line 92, in main
launch_flat_hunt(config)
File "flathunt.py", line 47, in launch_flat_hunt
hunter.hunt_flats()
File "~/flathunter/flathunter/hunter.py", line 42, in hunt_flats
for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
File "~/flathunter/flathunter/hunter.py", line 22, in crawl_for_exposes
for searcher in self.config.searchers()
File "~/flathunter/flathunter/hunter.py", line 23, in <listcomp>
for url in self.config.get('urls', list())])
File "~/flathunter/flathunter/abstract_crawler.py", line 136, in crawl
return self.get_results(url, max_pages)
File "~/flathunter/flathunter/crawl_immobilienscout.py", line 60, in get_results
soup = self.get_page(search_url, self.driver, page_no)
File "~/flathunter/flathunter/crawl_immobilienscout.py", line 120, in get_page
return self.get_soup_from_url(search_url.format(page_no), driver=driver, captcha_api_key=self.captcha_api_key, checkbox=self.checkbox, afterlogin_string=self.afterlogin_string)
File "~/flathunter/flathunter/abstract_crawler.py", line 75, in get_soup_from_url
self.resolvecaptcha(driver, checkbox, afterlogin_string, captcha_api_key)
File "~/flathunter/flathunter/abstract_crawler.py", line 151, in resolvecaptcha
iframe_present = self._check_if_iframe_visible(driver)
File "~/flathunter/flathunter/abstract_crawler.py", line 208, in _check_if_iframe_visible
(By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
File "~/.local/share/virtualenvs/flathunter-HYqahW9g/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Yeah, that will not be fixed soon. What i did is to rewrite to code a little and start a normal browser instead of a headless. In this browser i got the plugin from 2 captcha. Thats works for me
Which browser are you using ? @BananaMinion Could you share your changes ?
https://github.com/flathunters/flathunter/issues/134#issuecomment-973226074 I will as soon as i have time
Issue could be fixed with
driver.execute_cdp_cmd('Network.setBlockedURLs', {"urls": ["https://api.geetest.com/get.*"]})
driver.execute_cdp_cmd('Network.enable', {})
to prevent chrome from retrieving the cpatcha before 2captcha was able to do so
Does yours work? I didnt know this commands :) But im no python progger :D
Yeah, tested it and the captcha gets solved and it does again find new flats :) Me neither, found that via https://stackoverflow.com/a/67850301
Ah nice - i found a solution with a non headless browser - but i think yours is better
This code is merged now. @choeffer @dnberlin do you want to see if this works for you?
Works awesome! We can close this issue I think.
Works perfectly here too! Thanks :)
Guys I have a similar problem on a script that I wrote. Can you give it a check? Iām desperate for a solution
I got several errors like this, even when enabling "100% Recognition" at 2captcha. Any ideas?