flathunters / flathunter

A bot to help people with their rental real-estate search. 🏠🤖
GNU Affero General Public License v3.0
853 stars 183 forks source link

No flats coming through on Immoscout #577

Open zemaitistrys opened 6 months ago

zemaitistrys commented 6 months ago

Since May 2, 2024, there haven't been any ads coming through on my bot that is hosted on Google Cloud. The cloud project is set up exactly as in the tutorial and had been working fine for over a month now. I have logs enabled on the cloud app and every time the application is run, I get the error "IS24 bot detection has identified our script as a bot - we've been blocked", but instead of attempting the captcha with 2captcha as it used to do before, it just closes the application.

Has anyone also encountered this and/or know how to fix it?

fmmix commented 6 months ago

Same thing here. Sorry don't have any solution yet.

I also get this error: Got response from 2captcha/res: ERROR_CAPTCHA_UNSOLVABLE

I guess this is the main issue and then the page just times out or something and get's detected as bot.

mxfilerelatedcache commented 6 months ago

I think I have the same problem - Immoscout stopped working 2 days ago. However I don't think it's got to do with 2captcha. However, I can't really wrap my head around what the error message I'm getting tells us.

I'm using docker-compose on a Linux server with 2captcha enabled. This is the log from after starting the container:

Attaching to app-1
app-1  | [2024/05/08 05:48:25|config.py               |INFO    ]: Using config path /usr/src/app/config.yaml
app-1  | [2024/05/08 05:48:25|chrome_wrapper.py       |INFO    ]: Initializing Chrome WebDriver for crawler...
app-1  | [2024/05/08 05:48:25|patcher.py              |INFO    ]: patching driver executable /root/.local/share/undetected_chromedriver/undetected_chromedriver
app-1  | [2024/05/08 05:48:26|__init__.py             |INFO    ]: setting properties for headless
app-1  | [2024/05/08 05:48:27|immobilienscout.py      |WARNING ]: Unable to find IS24 variable in window
app-1  | [2024/05/08 05:48:27|immobilienscout.py      |ERROR   ]: IS24 bot detection has identified our script as a bot - we've been blocked
app-1  | [2024/05/08 05:48:28|chrome_wrapper.py       |INFO    ]: Initializing Chrome WebDriver for crawler...
app-1  | [2024/05/08 05:48:28|patcher.py              |INFO    ]: patching driver executable /root/.local/share/undetected_chromedriver/undetected_chromedriver
app-1  | [2024/05/08 05:48:29|__init__.py             |INFO    ]: setting properties for headless
 app-1  | Traceback (most recent call last):
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 793, in urlopen
app-1  |     response = self._make_request(
app-1  |                ^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 537, in _make_request
app-1  |     response = conn.getresponse()
app-1  |                ^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 466, in getresponse
app-1  |     httplib_response = super().getresponse()
app-1  |                        ^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/http/client.py", line 1395, in getresponse
app-1  |     response.begin()
app-1  |   File "/usr/local/lib/python3.11/http/client.py", line 325, in begin
app-1  |     version, status, reason = self._read_status()
app-1  |                               ^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/http/client.py", line 294, in _read_status
app-1  |     raise RemoteDisconnected("Remote end closed connection without"
app-1  | http.client.RemoteDisconnected: Remote end closed connection without response
app-1  | 
app-1  | During handling of the above exception, another exception occurred:
app-1  | 
app-1  | Traceback (most recent call last):
app-1  |   File "/usr/src/app/flathunt.py", line 99, in <module>
app-1  |     main()
app-1  |   File "/usr/src/app/flathunt.py", line 95, in main
app-1  |     launch_flat_hunt(config, heartbeat)
app-1  |   File "/usr/src/app/flathunt.py", line 44, in launch_flat_hunt
app-1  |     hunter.hunt_flats()
app-1  |   File "/usr/src/app/flathunter/hunter.py", line 56, in hunt_flats
app-1  |     for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
app-1  |                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/src/app/flathunter/hunter.py", line 35, in crawl_for_exposes
app-1  |     return chain(*[try_crawl(searcher, url, max_pages)
app-1  |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/src/app/flathunter/hunter.py", line 35, in <listcomp>
app-1  |     return chain(*[try_crawl(searcher, url, max_pages)
app-1  |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/src/app/flathunter/hunter.py", line 27, in try_crawl
app-1  |     return searcher.crawl(url, max_pages)
app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/src/app/flathunter/abstract_crawler.py", line 151, in crawl
app-1  |     return self.get_results(url, max_pages)
app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/src/app/flathunter/crawler/immobilienscout.py", line 96, in get_results
app-1  |     return self.get_entries_from_javascript()
app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/src/app/flathunter/crawler/immobilienscout.py", line 120, in get_entries_from_javascript
app-1  |     result_json = self.get_driver_force().execute_script('return window.IS24.resultList;')
app-1  |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 407, in execute_script
app-1  |     return self.execute(command, {"script": script, "args": converted_args})["value"]
app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 345, in execute
app-1  |     response = self.command_executor.execute(driver_command, params)
app-1  |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/remote_connection.py", line 302, in execute
app-1  |     return self._request(command_info[0], url, body=data)
app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/remote_connection.py", line 322, in _request
app-1  |     response = self._conn.request(method, url, body=body, headers=headers)
app-1  |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/_request_methods.py", line 144, in request
app-1  |     return self.request_encode_body(
app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/_request_methods.py", line 279, in request_encode_body
app-1  |     return self.urlopen(method, url, **extra_kw)
app-1  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/poolmanager.py", line 444, in urlopen
app-1  |     response = conn.urlopen(method, u.request_uri, **kw)
app-1  |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 847, in urlopen
app-1  |     retries = retries.increment(
app-1  |               ^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/util/retry.py", line 470, in increment
app-1  |     raise reraise(type(error), error, _stacktrace)
app-1  |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/util/util.py", line 38, in reraise
app-1  |     raise value.with_traceback(tb)
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 793, in urlopen
app-1  |     response = self._make_request(
app-1  |                ^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 537, in _make_request
app-1  |     response = conn.getresponse()
app-1  |                ^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 466, in getresponse
app-1  |     httplib_response = super().getresponse()
app-1  |                        ^^^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/http/client.py", line 1395, in getresponse
app-1  |     response.begin()
app-1  |   File "/usr/local/lib/python3.11/http/client.py", line 325, in begin
app-1  |     version, status, reason = self._read_status()
app-1  |                               ^^^^^^^^^^^^^^^^^^^
app-1  |   File "/usr/local/lib/python3.11/http/client.py", line 294, in _read_status
app-1  |     raise RemoteDisconnected("Remote end closed connection without"
app-1  | urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
app-1  | [2024/05/08 05:58:32|__init__.py             |INFO    ]: ensuring close
jukoson commented 6 months ago

I am not running with flathunter but my own crawler for Immoscout. I noticed that they seem to have switched from Geetest to AWS WAF. I haven't had time to look into yet, but 2captcha seems to support that kind of challenge. Just needs to be implemented.

fmmix commented 6 months ago

@jukoson

When I am running the same configuration (I am running seleniumbase instead of uc, but shouldn't matter too much) from my local machine via WSL it works and asks for a geetest captcha but on my VPS running directly on Linux it will return this pagesource:

<html><head> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1"> <meta name="robots" content="noindex, nofollow"> <title>Ich bin kein Roboter - ImmobilienScout24</title> <link rel="icon" type="image/x-icon" href="https://www.immobilienscout24.de/favicon.ico"> <link href="https://www.static-immobilienscout24.de/fro/core/5.10.0/font/vendor/make-it-sans/MakeItSansIS24WEB-Regular.woff2" as="font" type="font/woff2" crossorigin=""> <link rel="stylesheet" type="text/css" href="https://www.static-immobilienscout24.de/fro/core/5.10.0/css/core.min.css" crossorigin=""> <script type="text/javascript" src="https://82d925f87a91.edge.captcha-sdk.awswaf.com/82d925f87a91/jsapi.js"></script><script src="https://82d925f87a91.b0fd59a6.eu-central-1.token.awswaf.com/82d925f87a91/challenge.js"></script></head><body><header class="page-header--white"> <a href="https://www.immobilienscout24.de/" class="page-header__logo margin-left-xxl"> <img alt="ImmobilienScout24" src="https://www.static-immobilienscout24.de/fro/imperva/0.0.1/is24-logo.svg"> </a></header><div class="page-wrapper align-center margin-top-xxl"> <div class="main horizontal-center five-tenths"> <h1 class="align-center">Ich bin kein Roboter</h1> <div class="three-tenths horizontal-center palm-hide"> <img src="https://www.static-immobilienscout24.de/fro/imperva/0.0.1/robot-logo.svg"> </div> <div class="font-bold margin-top-xl"> Du bist ein Mensch aus Fleisch und Blut? Entschuldige bitte, dann hat unser System dich fälschlicherweise als Roboter identifiziert. Um unsere Services weiterhin zu nutzen, löse bitte diesen kurzen Test. </div> <div id="captcha-container" class="margin-top-m"><awswaf-captcha dir="ltr" style="display: block; width: 320px; margin: 0px auto;"></awswaf-captcha></div> <div class="font-bold margin-top-m">Warum haben wir deine Anfrage blockiert?</div> <div>Es kann verschiedene Gründe haben. Möglicherweise hast du</div> <ul class="list-bullet align-left margin-left-xxl"> <li>JavaScript deaktiviert.</li> <li>ungewöhnlich viele Anfragen an unser System gestellt.</li> </ul> <div id="requestId" class="margin-top-l">Request ID: someid</div> </div></div><footer class="main-footer"> <a href="https://www.immobilienscout24.de/impressum.html">Impressum</a> <div class="legend margin-top"> © Copyright 1999 - 2024 Immobilien Scout GmbH </div></footer><script type="text/javascript"> AwsWafCaptcha.renderCaptcha(document.querySelector("#captcha-container"), { apiKey: "somekey", onSuccess: (wafToken) => { AwsWafIntegration.saveReferrer(); if (window.location.search.includes("wafforce")) { const url = new URL(window.location); url.searchParams.delete("wafforce"); window.location.href = url.toString(); } else { window.location.reload(true); } }, defaultLocale: "de-DE", skipTitle: true }); const sheet = new CSSStyleSheet; sheet.replaceSync('.btn-primary, .btn-primary:hover { color: #333; background-color: #00ffd0; border: 1px solid #00ffd0; font-weight: 600; border-radius: 8px;}'); document.querySelector('awswaf-captcha').shadowRoot.adoptedStyleSheets.push(sheet); fetch(window.location.href).then(response => { document.querySelector("#requestId").innerHTML = "Request ID: " + response.headers.get('X-Amz-Cf-Id'); });</script></body></html>

As you said it will give us a awswaf captcha instead of the geetest.

Could it be that for 'bad ips' there is another layer on captcha protection? Does something like this exist. I am not really deeply into it.

codders commented 6 months ago

I'm still seeing geetest:

2024-05-09-172425_2146x954_scrot

According to the 2captcha docs, we need to captcha these details in order to solve a waf captcha:

{
    "clientKey": "YOUR_API_KEY",
    "task": {
        "type":"AmazonTask",
        "websiteURL": "https://efw47fpad9.execute-api.us-east-1.amazonaws.com/latest",
        "challengeScript": "https://41bcdd4fb3cb.610cd090.us-east-1.token.awswaf.com/41bcdd4fb3cb/0d21de737ccb/cd77baa6c832/challenge.js",
        "captchaScript": "https://41bcdd4fb3cb.610cd090.us-east-1.captcha.awswaf.com/41bcdd4fb3cb/0d21de737ccb/cd77baa6c832/captcha.js",
        "websiteKey": "AQIDA...wZwdADFLWk7XOA==",
        "context": "qoJYgnKsc...aormh/dYYK+Y=",
        "iv": "CgAAXFFFFSAAABVk",
    }
}

I can see captchaScript in the HTML you posted, but I don't know where we'd find the other info. Any hints / tips / reproduction steps welcome!

jukoson commented 6 months ago

There seem to be different versions of the Captcha challenge. One with key, context, iv, and then there is another one with only the challengeScript.

I checked 2captcha docs and did not spot an api call version for the latter one. But I also checked only briefly There's another service called Capsolver, and they seem to support it. I have no previous experience with this service though.

I will attempt an implementation early next week and share my results here. But maybe someone else is keen in doing so already :-)

fmmix commented 6 months ago

@codders

To reproduce it I just need to use my VPS. I think the reason might be that it got flagged as a bad actor due to the amount of requests. It still works locally with geetest. But there only with the AWS captcha.

The problem? (Dunno, have no experience there) is that the captcha only appears after the corresponding javascript is run (when I click inspect element). This is why the needed information is not directly in the pagesource. Can one just simulate that?

Then you get the needed information:

<p>Place a dot at the end of the car's path</p>
<div style="display: flex; margin: 10px 0px;"><p style="padding-right: 20px;">Solve by selecting the end of the colored line.</p><img gsrc="   data:image/svg+xml;base64,PxxxxxxxA==" alt="An image showing a car icon and a dot, connected by a line" style="height: 50px;"></div>

One would need to map the type of tests to have Capsolver solve the captcha:

https://docs.capsolver.com/guide/recognition/AwsWafClassification.html

Ah and the challenge rotates pretty fast. Hopefully that won't be another issue (actually like 60 seconds)

jukoson commented 6 months ago

You only really need to challengeScript, which is available when the challenge is AWS WAF.

I've created a basic example to dissect the challenge, submit it to Capsolver, and retrieve the result in form of a cookie. I couldn't manage to make use of that cookie yet, but I don't have much time right now either. In the current state of the example below, I am fetching the page again once adjusting my cookies. The result is that another AWS WAF challenge is presented.

I'll just post the example here in case anyone would like to pick it up. Create an account with Capsolver (they have a free trial) and replace the CAPSOLVER_API_KEY in the code. You may need to install seleniumbase and selenium_stealth since I didn't bother to align it with the actual packages used by this project. Actually stealth is not required here at all, but anyway.

from seleniumbase import Driver
import re
import requests
from time import sleep
from selenium_stealth import stealth
from selenium import webdriver

CAPSOLVER_API_KEY = "XXX"
CAPSOLVER_API_ENDPOINT = "https://api.capsolver.com/createTask"
url = "https://www.immobilienscout24.de/Suche/de/wohnung-mieten?sorting=2&pagenumber=1"
client = requests.Session()

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--headless")
driver = Driver(uc=True, headless=True, agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36", uc_cdp_events = True)
driver.set_page_load_timeout(20)
driver.execute_cdp_cmd('Network.enable', {})

stealth(driver,
        languages=["en-US", "en"],
        vendor="Google Inc.",
        platform="Win32",
        webgl_vendor="Intel Inc.",
        renderer="Intel Iris OpenGL Engine",
        fix_hairline=True,
        )

def delete_cookies():
    for cookie in driver.get_cookies():
        driver.delete_cookie(cookie['name'])

def get_aws_cookie():
    for cookie in driver.get_cookies():
        if(cookie['name'] == 'aws-waf-token'):
            return cookie

def resolve_aws(iter=0):
    script_content = driver.page_source

    key_match = re.search(r'"key":"([^"]+)"', script_content)
    iv_match = re.search(r'"iv":"([^"]+)"', script_content)
    context_match = re.search(r'"context":"([^"]+)"', script_content)
    jschallange_match = re.search(r'<script src="(.*?challenge.js.*?)".*?></script>', script_content)
    key = None
    iv = None
    context = None
    jschallange = None
    if key_match and iv_match and context_match:
        key = key_match.group(1)
        iv = iv_match.group(1)
        context = context_match.group(1)
        jschallange = jschallange_match.group(1)
        data = {
            "clientKey": CAPSOLVER_API_KEY,
            "task": {
                "type": "AntiAwsWafTaskProxyLess",
                "websiteURL": driver.current_url,
                "awsKey": key,
                "awsIv": iv,
                "awsContext": context,
                "awsChallengeJS": jschallange
            }
        }
    else:
        jschallange = jschallange_match.group(1)
        data = {
            "clientKey": CAPSOLVER_API_KEY,
            "task": {
                "type": "AntiAwsWafTaskProxyLess",
                "websiteURL": driver.current_url,
                "awsChallengeJS": jschallange
            }
        }

    try:
        task_id_response = client.post(CAPSOLVER_API_ENDPOINT, json=data)
        task_id = task_id_response.json()['taskId']

        try_cnt=0
        while True:
            cookie_response = client.post("https://api.capsolver.com/getTaskResult", json={"clientKey": CAPSOLVER_API_KEY, "taskId": task_id}).json()
            sleep(5)
            if cookie_response["status"] == "ready":
                cookie = cookie_response["solution"]["cookie"]

                # Replace the old cookie with the newly obtained
                old_cookie = get_aws_cookie()
                new_cookie = old_cookie
                new_cookie['value'] = cookie
                delete_cookies()
                driver.add_cookie(new_cookie)

                driver.uc_open_with_reconnect(driver.current_url, reconnect_time=3)

                return True
            elif cookie_response["status"] == "failed":
                return False
            else:
                try_cnt+=1
                if(try_cnt>5):
                    return False
                continue            
    except Exception as e:
        print(e)

# First delete all cookies, fetch IS24 page and solve AWS if presented

delete_cookies()
driver.uc_open(url)

if re.search("awswaf", driver.page_source):
    resolve_aws(0)
mxfilerelatedcache commented 6 months ago

@jukoson

Could it be that for 'bad ips' there is another layer on captcha protection? Does something like this exist. I am not really deeply into it.

If this is really the case and 'bad ips' get flagged and shown the AWS captcha, wouldn't one temporary solution be to randomise the crawling interval by a few secs/mins? I don't have any other idea as to how they would sense it's a 'bad ip', as ~10mins seems like a totally reasonable refreshing time for an actual human. It might just be about the regularity?

codders commented 6 months ago

@jukoson Thanks so much for the sample code! It shouldn't be too hard to integrate that into the crawlers that we have. I don't know if I'll get around to that this week, but if someone else wants to give it a go that would be very welcome!

diegopzz commented 6 months ago

You can solve using the documentation: https://docs.capsolver.com/guide/captcha/awsWaf.html And the blog: https://www.capsolver.com/blog/All/how-to-solve-aws-amazon-captcha-token

jukoson commented 6 months ago

You can solve using the documentation: https://docs.capsolver.com/guide/captcha/awsWaf.html And the blog: https://www.capsolver.com/blog/All/how-to-solve-aws-amazon-captcha-token

The code above already resolves the captcha - however it does not yet bring you to the actual page you wanted to land at.


I've progressed a little in that I can now solve the challenge and land on the is24 main website. From there, I accept cookies, type a search and then click the Search button. However another challenge pops up, after which I'm being redirected to the main page again. I'm a bit lost here - could someone else take a look as well ? I'm just a hardware engineer :) The code is very messy. At this stage I didn't bother keeping it clean

UPDATE this code is not solving the challenge properly and accidentally returning to the main page instead of submitting the captcha UPDATE

from seleniumbase import Driver
import re
import requests
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from random import randint

url = "https://www.immobilienscout24.de/Suche/de/wohnung-mieten?sorting=2&pagenumber=1"
CAPSOLVER_API_ENDPOINT = "https://api.capsolver.com/createTask"
CAPSOLVER_API_KEY = "XXX"
client = requests.Session()
DEFAULT_IS24_URL='https://www.immobilienscout24.de/'

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

userAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"

driver = Driver(uc=True, headless2=True, agent=userAgent, uc_cdp_events = True)
driver.set_page_load_timeout(20)
driver.execute_cdp_cmd('Network.setBlockedURLs',
    {"urls": ["https://api.geetest.com/get.*"]})
driver.execute_cdp_cmd('Network.enable', {})

def delete_cookies():
    for cookie in driver.get_cookies():
        driver.delete_cookie(cookie['name'])

def get_aws_cookie():
    for cookie in driver.get_cookies():
        if(cookie['name'] == 'aws-waf-token'):
            return cookie

def resolve_aws(iter=0):
    print(f"--------------------- ITER {iter}")
    script_content = driver.page_source

    key_match = re.search(r'"key":"([^"]+)"', script_content)
    iv_match = re.search(r'"iv":"([^"]+)"', script_content)
    context_match = re.search(r'"context":"([^"]+)"', script_content)
    jschallange_match = re.search(r'<script src="(.*?challenge.js.*?)".*?></script>', script_content)
    key = None
    iv = None
    context = None
    jschallange = None
    if key_match and iv_match and context_match:
        key = key_match.group(1)
        iv = iv_match.group(1)
        context = context_match.group(1)
        jschallange = jschallange_match.group(1)
        data = {
            "clientKey": CAPSOLVER_API_KEY,
            "task": {
                "type": "AntiAwsWafTaskProxyLess",
                "websiteURL": driver.current_url,
                "awsKey": key,
                "awsIv": iv,
                "awsContext": context,
                "awsChallengeJS": jschallange
            }
        }
    else:
        jschallange = jschallange_match.group(1)
        data = {
            "clientKey": CAPSOLVER_API_KEY,
            "task": {
                "type": "AntiAwsWafTaskProxyLess",
                "websiteURL": driver.current_url,
                "awsChallengeJS": jschallange
            }
        }

    try:
        task_id_response = client.post(CAPSOLVER_API_ENDPOINT, json=data)
        task_id = task_id_response.json()['taskId']

        try_cnt=0
        while True:
            cookie_response = client.post("https://api.capsolver.com/getTaskResult", json={"clientKey": CAPSOLVER_API_KEY, "taskId": task_id}).json()
            sleep(3)
            if cookie_response["status"] == "ready":
            # Get the cookie (AWS WAF token) from the CAPSOLVER response
                cookie = cookie_response["solution"]["cookie"]

                old_cookie = get_aws_cookie()
                driver.delete_cookie('aws-waf-token')
                new_cookie = old_cookie
                new_cookie['value'] = cookie
                driver.add_cookie(new_cookie)
                captcha_container = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "captcha-container")))
                interactive_element = WebDriverWait(captcha_container, 10).until(EC.element_to_be_clickable((By.XPATH, "//button | //input | //a")))
                interactive_element.click().  <---------- There is a mistake here, it's not clicking the submit button!

                print("Capsolver Solved")   
                if(driver.current_url == DEFAULT_IS24_URL):
                    print(f"Landed on DEFAULT_IS24_URL {DEFAULT_IS24_URL}")
                    driver.sleep(randint(1,3))
                    try:
                        shadow_root_ele = driver.find_element(By.CSS_SELECTOR, "#usercentrics-root").shadow_root
                        shadow_root_ele.find_element(By.CSS_SELECTOR, "button[data-testid='uc-accept-all-button']").click()
                    except Exception as e:
                        print(f"No cookie banner? e: {e}")
                    print("past the cookie banner!")
                    driver.sleep(randint(1,3))
                    if (driver.execute_script("return document.querySelectorAll('#oss-location')[1].value;") != 'Berlin'):
                        driver.type("(//input[@id='oss-location'])[2]", "Berlin")
                        driver.sleep(randint(1,3))
                    try:
                        # No idea why single click doesnt work
                        print("Clicking search")
                        driver.click('button.oss-main-criterion.oss-button.button-primary.one-whole.vertical-center-container')
                        driver.click('button.oss-main-criterion.oss-button.button-primary.one-whole.vertical-center-container')
                    except Exception as e:
                        print(f"Couldnt click? e: {e}")

                    driver.sleep(randint(1,3))

                return True
            elif cookie_response["status"] == "failed":
                print("capsolver failed")    
                return False
            else:
                print(f"capsolver not ready yet.... Status: {cookie_response["status"]}")    
                try_cnt+=1
                if(try_cnt>5):
                    print("capsolver did not process in time for the loop")    
                    return False
                continue            
    except Exception as e:
        print(f"Resolve AWS WAF failed with {e}")           

delete_cookies()
driver.uc_open(url)

if re.search("awswaf", driver.page_source):
    print("AWS WAF Challenge")
    resolve_aws(0)
    if re.search("awswaf", driver.page_source):
        resolve_aws(1)
    if re.search("awswaf", driver.page_source):
        resolve_aws(2)
fmmix commented 6 months ago

@jukoson

is it by design that you

Not criticizing or trying to gotcha you are or anything just wanting to unterstand the code.

I am not sure the capsolver actually worked. Have you tasted it anywhere else? Yes, when we put in the challenge .js we get some kind of response but does that help us to get past the captchas?

Instead of redirecting to main I try to use the 'submit' button after switching the cookies. But this didn't work. It said wrong answer.

I replaced captcha_container and interactive_element with:

  shadow_root_ele = driver.find_element(By.TAG_NAME, 'awswaf-captcha').shadow_root
  driver.sleep(2)
  shadow_root_ele.find_element(By.CSS_SELECTOR,'#amzn-btn-verify-internal').click()

I am not sure how the solver is supposed to work, but changing the cookie and then clicking ok is apparently not it

jukoson commented 6 months ago

@fmmix Thanks a lot for your response!

It was never intended to return to the webpage - I gave ChatGpt parts of the website and asked it to click that button for me to be honest. I didn't question it since I also did not notice any other buttons on the page. I don't have access to my laptop right now, but I guess it's just clicking the "Immoscout" link on the top left, or something similar.

I could not explain why resolving the challenge would bring me to the main pafge, so I thought it's just some extra layer or sanity check that I'm not a bot. Now it all makes sense however. I'll look at the captcha solving again when I have some time. Thanks again !

fmmix commented 6 months ago

@jukoson

I don't have any experience with stuff like this myself, digging through website elements etc. Just bruteforcing myself through it :-D

That line here will tell the driver to click on the top left symbol which you already guessed.

` interactive_element = WebDriverWait(captcha_container, 10).until(EC.element_to_be_clickable((By.XPATH, "//button | //input | //a")))

` corresponding XPATH is: /html/body/header/a

The issue with the page is that the submit/bestätigen button is behind a #shadow-root which will hide the elements from your normal find elements functions. You are already using the shadow-root on line 106, 107 I just placed them on the search page too 97-99. It will click on the button when you replace your 2,3 lines with the one I posted but it will say wrong, please try again.

But it looks like the idea is to not have to press any buttons by replacing the cookie it should just solve the problem by refreshing the page.

I tried it using the 202 example from the blog: If we replace our url with: https://efw47fpad9.execute-api.us-east-1.amazonaws.com/latest and then send the awswaf jschallenge the same way, replace the cookie value and then refresh the browser either with driver.refresh() or driver.open(url) or driver.uc_open_with_reconnect(url, 3) and it all worked.


I played a bit more and almost got it to work. Instead of the capsolver response, I tried using the cookie value I got after doing the captcha manually. It still didn't work but I played around more and found out that one of the issues was that I needed to scroll down after switching the cookies for it to have any affect. Then the driver will actually open the captcha free site. But this only works with my manual generated cookie. None of the capsolver values worked on immoscout. The same code does work for the example page in the blog though (when we just replace the url). I suspect the capsolver simply doesn't work for immoscout. No idea what else would be missing.

import re
import requests
from selenium import webdriver
from seleniumbase import SB

#URL = "https://efw47fpad9.execute-api.us-east-1.amazonaws.com/latest"
URL = "https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?numberofrooms=4.0-&price=-3500.0&exclusioncriteria=swapflat&pricetype=rentpermonth&sorting=2&enteredFrom=result_list"
CAPSOLVER_API_ENDPOINT = "https://api.capsolver.com/createTask"
CAPSOLVER_API_KEY = "CAPxx"
MANUAL_COOKIE = '33xx'

client = requests.Session()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
#options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

def open_page(sb, url):
    sb.driver.uc_open_with_reconnect(url, reconnect_time=2)

def scroll_down(sb):
    sb.scroll_to_bottom()

def get_aws_cookie(sb):
    for cookie in sb.get_cookies():
        if (cookie['name'] == 'aws-waf-token'):
            return cookie

def resolve_aws(sb, iter=0):
    #sb.driver.sleep(60)
    print(f"--------------------- ITER {iter}")
    script_content = sb.driver.page_source

    key_match = re.search(r'"key":"([^"]+)"', script_content)
    iv_match = re.search(r'"iv":"([^"]+)"', script_content)
    context_match = re.search(r'"context":"([^"]+)"', script_content)
    jschallange_match = re.search(
        r'<script src="(.*?challenge.js.*?)".*?></script>', script_content)
    key = None
    iv = None
    context = None
    jschallange = None
    if key_match and iv_match and context_match:
        key = key_match.group(1)
        iv = iv_match.group(1)
        context = context_match.group(1)
        jschallange = jschallange_match.group(1)
        data = {
            "clientKey": CAPSOLVER_API_KEY,
            "task": {
                "type": "AntiAwsWafTaskProxyLess",
                "websiteURL": sb.driver.current_url,
                "awsKey": key,
                "awsIv": iv,
                "awsContext": context,
                "awsChallengeJS": jschallange
            }
        }
    else:
        jschallange = jschallange_match.group(1)
        data = {
            "clientKey": CAPSOLVER_API_KEY,
            "task": {
                "type": "AntiAwsWafTaskProxyLess",
                "websiteURL": sb.driver.current_url,
                "awsChallengeJS": jschallange
            }
        }
        print(data)

    try:
        task_id_response = client.post(CAPSOLVER_API_ENDPOINT, json=data)
        task_id = task_id_response.json()['taskId']
        sb.driver.sleep(10) # didn't want to loop, just a high enough number
        cookie_response = client.post(
            "https://api.capsolver.com/getTaskResult",
            json={"clientKey": CAPSOLVER_API_KEY, "taskId": task_id}).json()
        if cookie_response["status"] == "ready":
            print('ready')
            # Get the cookie (AWS WAF token) from the CAPSOLVER response
            cookie = cookie_response["solution"]["cookie"]

            old_cookie = get_aws_cookie(sb)
            print('old')
            print(old_cookie)
            sb.driver.delete_cookie('aws-waf-token')
            new_cookie = old_cookie
            #new_cookie['value'] = cookie
            new_cookie['value'] = MANUAL_COOKIE
            print('new')
            print(new_cookie)
            sb.driver.add_cookie(new_cookie)
            sb.driver.sleep(3)
            sb.driver.refresh()
            sb.driver.sleep(5)
    except Exception as e:
        print(f"Resolve AWS WAF failed with {e}")

with SB(uc=True, headed=True, test=True) as sb:
    open_page(sb, URL)
    try:
        scroll_down(sb)
        resolve_aws(sb)
        print(sb.driver.page_source)
    except Exception as e:
        print(e)

Not a full solution just the page_source part. I modified the while into a longer wait. And I run everything headed since I want to see what it does :-D, in headless2=True under with SB (...) it works just the same with getting the page_source. To have a valid cookie, uncomment the sleep at the top of resolve_aws and run SB, headed=True, solve the captcha manually and then copy the value from inspect - applications - cookies - immoscout24.de awswaf and there the value.

Maybe someone else can solve the missing puzzle? Or find another captcha solver for aws.

Update: Another thing I found out is that the length of my manual cookie values are 326 and from the captcha solver only 262. Looks like the solver is simply not working correctly

jukoson commented 6 months ago

Thanks @fmmix , really good work. I feel this is getting somewhat closer to a solution.

I just checked here as well, and the cookie length that I receive from capsolver is 262 compared to 326 with manual solving. (That's 64 apart ... as a person that counts in the binary system: coincidence ?) Will have some time to dig further on tuesday.

For now, I've reached out to 2Captcha support to see if they can resolve the 202 response syntax (jschallenge only) and also to Capsolver to ask whether they have an explanation for what could go wrong. Will update here.

jukoson commented 6 months ago

I've got some good news over here. I switched to a different captcha solving service and it ... just works. The format of the request to the service is not properly documented, so there was some back and forth with the support. Anyway, I could get it to work now with the Capmonster service.

Attached is an example code that works for me. Please let me know if it works for others here too.

import re
import requests
from selenium import webdriver
from seleniumbase import SB

URL = "https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?numberofrooms=4.0-&price=-3500.0&exclusioncriteria=swapflat&pricetype=rentpermonth&sorting=2&enteredFrom=result_list"
SOLVER_API_ENDPOINT_CREATE = "https://api.capmonster.cloud/createTask"
SOLVER_API_ENDPOINT_GET = "https://api.capmonster.cloud/getTaskResult"
SOLVER_API_KEY = "XYZ"

client = requests.Session()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)

def open_page(sb, url):
    sb.driver.uc_open_with_reconnect(url, reconnect_time=2)

def get_aws_cookie(sb):
    for cookie in sb.get_cookies():
        if (cookie['name'] == 'aws-waf-token'):
            return cookie

def resolve_aws(sb, iter=0):
    patternJsApi = r'src="([^"]*jsapi\.js)"'
    jsapi_matches = re.findall(patternJsApi, sb.driver.page_source)
    for match in jsapi_matches:
        print(f'SRC Value: {match}')   
        jsapi = match

    patternKey = r'apiKey:\s*"([^"]+)"'
    match = re.search(patternKey, sb.driver.page_source)

    if match:
        api_key = match.group(1)
        print(f'apiKey: {api_key}')
    else:
        print('No apiKey found.')
        exit()
    data = {
        "clientKey": SOLVER_API_KEY,
        "task": {
        "type": "AmazonTaskProxyless",
        "websiteURL": URL,
        "userAgent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "captchaScript": jsapi,
        "websiteKey": api_key,
        "challengeScript": "",
        "context": "",
        "iv": "",
        "cookieSolution": True
        }
    }

    try:
        task_id_response = client.post(SOLVER_API_ENDPOINT_CREATE, json=data)
        task_id = task_id_response.json()['taskId']

        try_cnt=0
        while True:       
            sb.driver.sleep(5)
            cookie_response = client.post(SOLVER_API_ENDPOINT_GET, json={"clientKey": SOLVER_API_KEY, "taskId": task_id}).json()             
            if cookie_response["status"] == "ready":
                print(f'ready: {cookie_response}')
                cookie = cookie_response["solution"]["cookies"]['aws-waf-token']
                old_cookie = get_aws_cookie(sb)
                sb.driver.delete_cookie('aws-waf-token')
                new_cookie = old_cookie
                new_cookie['value'] = cookie
                sb.driver.add_cookie(new_cookie)
                sb.driver.sleep(3)
                sb.driver.refresh()
                sb.driver.sleep(5)
                return True
            elif cookie_response["status"] == "failed":
                print(f"solver failed: {cookie_response}")    
                exit()
            else:
                print(f"solver not ready yet.... Status: {cookie_response}")    
                try_cnt+=1
                if(try_cnt>5):
                    print("solver did not process in time for the loop")    
                    exit()
                continue                     
    except Exception as e:
        print(f"Resolve AWS WAF failed with {e}")

def is_aws_waf(sb):
    is_awswaf = re.search("awswaf", sb.driver.page_source)
    is_roboter = re.search("Roboter", sb.driver.page_source)
    return is_awswaf and is_roboter

with SB(uc=True, headed=True) as sb:
    open_page(sb, URL)
    if is_aws_waf(sb):
        resolve_aws(sb)
        if is_aws_waf(sb):
            print(".... STILL AWS")
        else:
            print("Resolved !!!")
    else:
        exit()    
fmmix commented 6 months ago

@jukoson

I was eagerly waiting for your post hehe.

I can confirm it works for me too🥳 Amazing stuff!

(In the name of science I sacrificed 7 dollars since it was the lowest amount I could paypal and didn't find the free trial at first, looks like you can request it from support if you want a free trial without paying - oh well, gonna use it up eventually 😅 )

Looks like scrolling isn't even needed which is great since I only used the SB context for the scroll function. In flathunter the driver gets passed around to different parts of the code and just using that instead of the context might be easier to implement here.

MehrAmoonCraft commented 4 months ago

Does anyone commited the fix? Cant use Immoscout :(

Oli4 commented 3 months ago

Thank you very much @jukoson, flathunter works for immoscout again. Is there any discussion on how to bring this fix into the repository? Probably the question right now is how to continue with the captcha solving. Options are

Either way thanks to everyone for maintaining this project!

codders commented 3 months ago

Hi @Oli4, all,

I am one of the maintainers of flathunter. From my side, I am very open to well-formed pull requests - I am happy to review them and provide feedback. I don't have the capacity to work on it myself right now, but I will make the time to seriously look at and test and merge good PRs.

We can support multiple captcha engines - we already do - so adding a new engine shouldn't be any trouble. I would be reluctant to remove existing support since it's really hard to say what users are using which features, and backward-compatibility is important.

Happy also to answer high-level questions about approach here. If we need to change the crawler architecture a bit to support a new captcha solver, we can do that. But someone needs to write the PR and not break existing configs.

Thanks,

Arthur

jukoson commented 3 months ago

Proposing a PR has been on my ToDo ever since.

I am not too familiar with the flathunter codebase and due to a lack of time I couldn't manage. If anyone wants to pick it up, let me know. Otherwise I'm positive to provide something within a week or two.

For anyone struggling with captcha recognition or in desire for a no-care solution I would like to recommend taking a look at my own project www.immobilien-bot.de or straight on telegram: https://t.me/ImmobilienBot_bot

@codders I am happy to remove the reference if this is inappropriate. By the way Flathunter is recommended on the website :-)

DerLeole commented 3 months ago

@jukoson Have implemented your fix ^^

@codders Hope my PR is in good form enough to be merged, would love to get some feedback on style and implementation as this is my first PR on anything python.

DerLeole commented 3 months ago

Small sad update:

Seems like the changes worked on my home machine, but deploying it to my server within a docker container still resulted in failure. After some tests it looks like the entire script tag that has the captcha script in it isn't even included in the original html requests, so there must be some kind of secondary safeguard that prevents the captcha from even appearing in some cases.

Using mullvad VPN on the container also made no difference, neither did some of the driver options mentioned here before. I'm at my wits end for now.

The weirdest part is that afaik the AWS WAF documentation doesn't include any features that would hide their scripts serverside in some cases. Very curious.

codders commented 3 months ago

@jukoson Thanks so much for contributing the patches and info that @DerLeole then developed into a PR, and sharing your solution to the riddle of the Immoscout captcha. I don't have a problem you mentioning immobilien-bot here - I'm happy to finally know who it is that's behind the site. And yes - thanks for the link to Flathunter! If there are ways to bring your code and our code closer together so that collaboration is easier and we both do less maintenance work, I'm happy to talk about it.

Meatplay commented 3 months ago

I'm currently also working on a fix for 2captcha. The IV and context are rotated via additional network requests, which I captured with selenium-wire. I could actually get a sovled captcha back from 2captcha but for some reason, the response from 2captcha did not work. I contacted them, and they send me a solution using their coordinate solver (generic picture capture solver). I will try to integrate it and open a PR

DerLeole commented 3 months ago

Awesome! I got too that point as well, but their solution never worked and the length was significantly shorter than the length of a correctly manually solved one.

If you don't want to use selenium wire, there is a way to get all background request ids through logging and then use the ChromeDevToolkit API to access all the request data. That's what I did in the PR in my initial attempt.

AntonKorobkov commented 2 months ago

Hey @DerLeole thank you for implementing your solution in a PR. I'd really like to see this merged as a fellow immoscout enjoyer, so I added small fixes like code style changes to your fork, that (hopefully) will speed up this process. Please check it and merge so it can be propagated to the main repo, when you have time. Thanks in advance!

DerLeole commented 2 months ago

@AntonKorobkov Thank you very much and sorry on the delay of this, past 2 weeks got busy.

I actually got an answer from 2captcha support on how to implement captcha solving using an universal approach I wanna try. Support also said, they are working on support for the new aws implementation.

Looking into that next week.

Meatplay commented 2 months ago

@DerLeole I already implemented the coordinate solver for 2captcha and it does work. But after one hour on the second try it raises an exception currently. One could probably fix this by just using a while loop instead of the backup package, but I wanted it to be consistent with the rest of the project. I currently do not have too much time to look into the problem. But I opened a PR with my current status so feel free to adapt it.

631

MehrAmoonCraft commented 3 weeks ago

Does anyone fixed the PR?