flathunters / flathunter

A bot to help people with their rental real-estate search. šŸ šŸ¤–
GNU Affero General Public License v3.0
852 stars 182 forks source link

Immoscout get HTTP 405 #119

Closed choeffer closed 2 years ago

choeffer commented 3 years ago

I got several errors like this, even when enabling "100% Recognition" at 2captcha. Any ideas?

[2021/04/20 22:41:28|abstract_crawler.py|ERROR   ]: Got response (405): b'<!DOCTYPE html>\n<html>\n\n<head>\n    <script>\n        (function () {\n            try {\n                if (typeof sessionStorage !== \'undefined\') {\n                    sessionStorage.setItem(\'distil_referrer\', document.referrer); \n                }\n            } catch (e) {}\n        })()\n    </script>\n    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1, minimum-scale=1, maximum-scale=1" />\n    <meta http-equiv="X-UA-Compatible" content="IE=edge" />\n    <meta name="robots" content="noindex, nofollow">\n    <meta http-equiv="cache-control" content="no-cache, no-store, must-revalidate">\n    <meta http-equiv="pragma" content="no-cache">\n    <meta http-equiv="expires" content="0">\n    <title>Ich bin kein Roboter - ImmobilienScout24</title>\n    <link rel="icon" type="image/vnd.microsoft.icon" href="https://www.immobilienscout24.de/favicon.ico" />\n    <link rel="shortcut icon" type="image/vnd.microsoft.icon" href="https://www.immobilienscout24.de/favicon.ico" />\n    <style>\n        @font-face {\n            font-family: "Make It Sans IS24 Web";\n            font-style: normal;\n            font-weight: 400;\n            font-display: swap;\n            src: url("https://www.static-immobilienscout24.de/fro/core/4.4.1/font/vendor/make-it-sans/MakeItSansIS24WEB-Regular.woff2") format("woff2"), url("https://www.static-immobilienscout24.de/fro/core/4.4.1/font/vendor/make-it-sans/MakeItSansIS24WEB-Regular.woff") format("woff");\n        }\n\n        @font-face {\n            font-family: "Make It Sans IS24 Web";\n            font-style: normal;\n            font-weight: 700;\n            font-display: swap;\n            src: url("https://www.static-immobilienscout24.de/fro/core/4.4.1/font/vendor/make-it-sans/MakeItSansIS24WEB-Bold.woff2") format("woff2"), url("https://www.static-immobilienscout24.de/fro/core/4.4.1/font/vendor/make-it-sans/MakeItSansIS24WEB-Bold.woff") format("woff");\n        }\n\n        @font-face {\n            font-family: \'IS24Icons\';\n            src: url(\'https://www.static-immobilienscout24.de/fro/core/4.4.1/font/vendor/is24-icons/is24-icons.woff\') format(\'woff\');\n            font-weight: normal;\n            font-style: normal;\n        }\n\n        a,\n        abbr,\n        address,\n        article,\n        aside,\n        audio,\n        b,\n        blockquote,\n        body,\n        canvas,\n        caption,\n        cite,\n        code,\n        dd,\n        del,\n        details,\n        dfn,\n        div,\n        dl,\n        dt,\n        em,\n        fieldset,\n        figcaption,\n        figure,\n        footer,\n        form,\n        h1,\n        h2,\n        h3,\n        h4,\n        h5,\n        h6,\n        header,\n        html,\n        i,\n        iframe,\n        img,\n        input,\n        ins,\n        kbd,\n        label,\n        legend,\n        li,\n        main,\n        mark,\n        menu,\n        nav,\n        object,\n        ol,\n        p,\n        pre,\n        q,\n        samp,\n        section,\n        select,\n        small,\n        span,\n        strong,\n        sub,\n        summary,\n        sup,\n        table,\n        tbody,\n        td,\n        textarea,\n        tfoot,\n        th,\n        thead,\n        time,\n        tr,\n        ul,\n        var,\n        video {\n            -ms-box-sizing: border-box;\n            -o-box-sizing: border-box;\n            box-sizing: border-box;\n            margin: 0;\n            padding: 0;\n            border: 0;\n            outline: 0;\n        }\n\n        html {\n            font-size: 62.5%;\n        }\n\n        body {\n            background-color: #fff;\n            color: #333;\n            font-size: 1.4em;\n            line-height: 1.61;\n            font-family: "Make It Sans IS24 Web", Verdana, "DejaVu Sans", Arial, Helvetica, sans-serif;\n        }\n\n        .page-wrapper {\n            margin-left: auto;\n            margin-right: auto;\n            max-width: 1170px;\n            background-color: #fff;\n        }\n\n        .grid {\n            display: block;\n            margin-right: 0;\n        }\n\n        .grid:after {\n            display: table;\n            clear: both;\n            content: "";\n        }\n\n        .grid-item {\n            display: block;\n            float: left;\n            vertical-align: top;\n            text-align: left;\n        }\n\n        .header {\n            border-bottom: 1px solid #e0e0e0;\n        }\n\n        .header .grid {\n            padding-left: 70px;\n            padding-right: 70px;\n            padding-top: 14px;\n            padding-bottom: 14px;\n        }\n\n        .header .logo {\n            width: 50%;\n            float: left;\n        }\n\n        .header .logo img {\n            vertical-align: top;\n        }\n\n        .header .login-button {\n            width: 50%;\n            text-align: right;\n            float: left;\n        }\n\n        .header .login-button a {\n            padding-top: .35714286em;\n            padding-bottom: .35714286em;\n            min-width: 9.42857143em;\n            font-family: "Make It Sans IS24 Web", Verdana, "DejaVu Sans", Arial, Helvetica, sans-serif;\n            border-radius: 8px;\n            background-color: #fff;\n            display: inline-block;\n            border: 1px solid #333333;\n            padding: .64285714em 1.64285714em;\n            font-weight: 600;\n            font-size: 1.4rem;\n            text-align: center;\n            letter-spacing: .2px;\n            line-height: 1.42857143em;\n            white-space: nowrap;\n            cursor: pointer;\n            color: #333333;\n        }\n\n        .header .login-button a:link,\n        .header .login-button a:visited,\n        .header .login-button a:focus,\n        .header .login-button a:hover {\n            text-decoration: none;\n            color: #333333;\n        }\n\n        .header .login-button a:hover {\n            background-color: #eaeaea;\n        }\n\n        .main {\n            clear: both;\n            padding-top: 55px;\n            max-width: 583px;\n            margin-left: auto;\n            margin-right: auto;\n            text-align: center;\n        }\n\n        .main .headline {\n            font-size: 4.0rem;\n            font-weight: bold;\n            letter-spacing: 0px;\n            line-height: 4.8rem;\n            text-align: center;\n        }\n\n        .main .main__logo {\n            padding-top: 10px;\n            text-align: center;\n        }\n\n        .main .main__logo img {\n            height: 240px;\n            width: 240px;\n            vertical-align: top;\n        }\n\n        .main .main__part1 {\n            padding-top: 11px;\n            font-size: 1.4rem;\n            font-weight: bold;\n            letter-spacing: 0px;\n            line-height: 20px;\n        }\n\n        .main .main__captcha {\n            padding-top: 36px;\n            padding-bottom: 36px;\n        }\n\n        .main .main_part2_header1 {\n            font-weight: bold;\n        }\n\n        .main .main_part2_header2 {\n            font-weight: bold;\n            padding-top: 16px;\n        }\n\n        .main .main__list {\n            padding-top: 14px;\n            padding-bottom: 42px;\n        }\n\n        .main .main__list ul li {\n            list-style-position: inside;\n        }\n\n        .footer {\n            background: #f2f2f2;\n            text-align: center;\n        }\n\n        .footer .footer-content {\n            max-width: 583px;\n            margin-left: auto;\n            margin-right: auto;\n            padding-top: 15px;\n            padding-bottom: 6px;\n            color: #757575;\n            font-size: 1.2rem;\n            line-height: 1.6rem;\n        }\n\n        .footer .footer-content div {\n            padding-top: 20px;\n        }\n\n        .footer .footer-content div:first-child {\n            padding-top: 0;\n        }\n\n        .footer .footer-content a,\n        .footer .footer-content a:visited,\n        .footer .footer-content a:link,\n        .footer .footer-content a:focus,\n        .footer .footer-content .legend {\n            color: #757575;\n            font-size: 1.2rem;\n            line-height: 1.6rem;\n            text-decoration: none;\n        }\n\n        .footer .footer-content a:hover {\n            color: #757575;\n            font-size: 1.2rem;\n            line-height: 1.6rem;\n            text-decoration: underline;\n        }\n\n        .g-recaptcha {\n            display: inline-block;\n        }\n        \n        .geetest_holder {\n            margin: 0 auto;\n        }\n\n        @media (max-width: 668px) {\n            .palm-hide {\n                display: none;\n            }\n\n            .header .grid {\n                padding-left: 16px;\n                padding-right: 16px;\n                padding-top: 8px;\n                padding-bottom: 8px;\n            }\n\n            .main {\n                padding-top: 32px;\n                padding-left: 16px;\n                padding-right: 16px;\n            }\n\n            .main .headline {\n                font-size: 3.2rem;\n                font-weight: normal;\n                line-height: 4.0rem;\n            }\n\n            .main .main__logo img {\n                height: 188px;\n                width: 188px;\n            }\n\n            .footer .footer-content {\n                padding-bottom: 32px;\n            }\n\n        }\n    </style>\n\n    <script>\n        function showBlockPage() {\n            console.log("showing block page");\n        }\n        setTimeout(showBlockPage, 10000);\n    </script>\n    <script type="text/javascript" src="/assets/immo-1-17" async defer></script>\n    \n    <script>\n    window.captchaDescription = \'<p>Nachdem du das unten stehende CAPTCHA best\xc3\xa4tigt hast, wirst du sofort auf die von dir angefragte Seite weitergeleitet.</p>\';\n    window.geetestLang = \'de\';\n    </script>\n    \n    <script src=\'https://www.google.com/recaptcha/api.js?hl=de\'></script>\n    \n                    <script src="https://static.geetest.com/static/tools/gt.js"></script>\n                       <script>\n                          initGeetest({\n                            gt: "0fdbade8a0fe41cba0ff758456d23dfa",\n                            challenge: "8ceb4d705b1888572186821a13f88a1e",\n                            offline: false,\n                            new_captcha: true,\n                            lang: window.geetestLang || "en",\n                          }, function (captchaObj) {\n                            captchaObj.onSuccess(function () {\n                                var obj = captchaObj.getValidate();\n                                solvedCaptcha({\n                                    geetest_challenge: obj.geetest_challenge,\n                                    geetest_seccode: obj.geetest_seccode,\n                                    geetest_validate: obj.geetest_validate,\n                                    data: "3:jLlXJG3MOjTMyjWA1ZYXvA==:FhtcP9zcFCS+qL3P2GLawQzwnMmAKpetjy4tsCzNHd5V3l1qtQyG+5MqG8c2S54Q:SDN/BDvW0xDi1k/WZFZEObyBH0kUStOM6NC2jzo3uXg="\n                                });\n                            });\n                            captchaObj.appendTo(\'#captcha-box\');\n                          });\n                       </script>\n                    \n                    <script>\n                        function solvedCaptcha(payload) {\n                            const timeoutMs = 10000;\n                            protectionSubmitCaptcha("geetest", payload, timeoutMs, "3:zPx5s4tjWiqDMbvY6BbRwQ==:TWXQlUP5uV4ajSwdUVE+Kh3fC/392zbuZ1tfX/u5ugegl9dGOzeEy5Pyf5SMPX5M2luacrjXyHANvljEFpH3hSEHeIB31m1jbQygumf8/yFWjGYuLwDMG96mIXPeSP1Q0xzsQuti4M4FFBQpLniCZIyTppZ6jshKN8sKrrVKSOfNBEO3rcqevFLf3MGqdlmf:703aLNLcyoi3tzTkMgGfreQTnRyzS6Ouog6WE8nqckQ=")\n                                .then(\n                                    function() {\n                                        window.location.reload(true);\n                                    },\n                                    function(error) {\n                                        console.log(error);\n                                    },\n                                );\n                        }\n                    </script>\n                \n</head>\n\n<body>\n\n    <div class="header">\n        <div class="page-wrapper">\n            <div class="grid">\n                <div class="logo grid-item">\n                    <a href="https://www.immobilienscout24.de/">\n                        <img src="https://www.static-immobilienscout24.de/fro/imperva/0.0.1/is24-logo.svg"\n                            alt="ImmoScout24 Logo">\n                    </a>\n                </div>\n                <div class="login-button grid-item">\n                    <a\n                        href="https://www.immobilienscout24.de/geschlossenerbereich/start.html?source=meinkontodropdown-login">\n                        Anmelden <span class="palm-hide">/ Registrieren</span>\n                    </a>\n                </div>\n            </div>\n        </div>\n    </div>\n\n    <div class="page-wrapper">\n\n        <div class="main">\n            <div class="headline">\n                \n                \n                Ich bin kein Roboter\n            </div>\n            <div class="main__logo">\n                <img src="https://www.static-immobilienscout24.de/fro/imperva/0.0.1/robot-logo.svg" alt="Roboter Logo">\n            </div>\n            <div class="main__part1">\n                \n                \n                Du bist ein Mensch aus Fleisch und Blut? Entschuldige bitte, dann hat unser System dich\n                f\xc3\xa4lschlicherweise als Roboter identifiziert. Um unsere Services weiterhin zu nutzen, l\xc3\xb6se bitte diesen\n                kurzen Test.\n            </div>\n\n            <div class="main__captcha">\n                \n                <div id="explanation" class="container">\n                    \n                    <script>\n                    showBlockPage()\n                    document.writeln(window.captchaDescription || "<p>After completing the CAPTCHA below, you will immediately regain access to the site again.</p>");\n                    </script>\n                <div id="captcha-box"></div>\n                </div>\n            </div>\n\n            <script type="text/javascript" charset="UTF-8">\n                const translatedStrings = {\n                    toRegainAccess: {\n                        EN: "To regain access, please make sure that cookies and JavaScript are enabled before reloading the page",\n                        DE: "Um wieder Zugriff zu erhalten, stelle bitte sicher, dass Cookies und JavaScript aktiviert sind, bevor du die Seite neu l\xc3\xa4dst",\n                    },\n                };\n\n                function translateDoc(language, text) {\n                    let replacement = text;\n\n                    Object.entries(translatedStrings).forEach(([key, value]) => {\n                        // Checks English string is present and a translation for the selected language exists before attempting to replace\n                        if (value.EN && value[language]) {\n                            replacement = replacement.replace(value.EN, value[language]);\n                        }\n                    });\n\n                    return replacement;\n                }\n\n                document.addEventListener("DOMContentLoaded", function () {\n                    const impervaContent = document.getElementById("explanation")\n                        .outerHTML;\n\n                    const translatedContent = translateDoc("DE", impervaContent);\n\n                    document.body.innerHTML = document.body.innerHTML.replace(\n                        impervaContent,\n                        translatedContent\n                    );\n                });\n            </script>\n\n            <div class="main__part2">\n                <div class="main_part2_header1">Warum f\xc3\xbchren wir diese Sicherheitsma\xc3\x9fnahme durch?</div>\n                <div class="main_part2_text1">Mit der Captcha-Methode stellen wir fest, dass du kein\n                    Roboter oder eine sch\xc3\xa4dliche Spam-Software bist. Damit sch\xc3\xbctzen wir unsere Webseite und die Daten\n                    unserer Nutzerinnen und Nutzer vor betr\xc3\xbcgerischen Aktivit\xc3\xa4ten.</div>\n\n                <div class="main_part2_header2">Warum haben wir deine Suchanfragen blockiert?</div>\n                <div class="main_part2_text2">Es kann verschiedene Gr\xc3\xbcnde haben, warum wir dich f\xc3\xa4lschlicherweise als\n                    Roboter identifiziert haben. M\xc3\xb6glicherweise</div>\n\n            </div>\n            <div class="main__list">\n                <ul>\n                    <li>hast du die Cookies f\xc3\xbcr unsere Seite deaktiviert.</li>\n                    <li>hast du die Ausf\xc3\xbchrung von JavaScript deaktiviert.</li>\n                    <li>nutzt du ein Browser-Plugin eines Drittanbieters, beispielsweise einen Ad-Blocker.</li>\n                    <li>hast du in kurzer Zeit mehr Anfragen an unser System gestellt, als es\n                        \xc3\xbcblicherweise der Fall ist.</li>\n                </ul>\n            </div>\n\n\n        </div>\n\n    </div>\n\n    <div class="footer">\n        <div class="footer-content">\n\n\n            <div>\n                <a href="https://www.immobilienscout24.de/unternehmen.html">\xc3\x9cber uns</a> |\n                <a href="https://www.immobilienscout24.de/kontakt.html">Kontakt & Hilfe</a> |\n                <a href="https://www.immobilienscout24.de/unternehmen/karriere/">Karriere</a> |\n                <a href="https://www.immobilienscout24.de/sitemap.html">Sitemap</a> |\n                <a href="https://api.immobilienscout24.de">Developer</a> |\n                <a href="https://www.immobilienscout24.de/unternehmen/mediendienst.html">Presseservice</a> |\n                <a href="https://www.immobilienscout24.de/ratgeber/newsletter.html">Newsletter abonnieren</a> |\n                <a href="https://www.immobilienscout24.de/impressum.html">Impressum</a> |\n                <a href="https://www.immobilienscout24.de/agb.html">AGB\'s & Rechtliche Hinweise</a> |\n                <a\n                    href="https://www.immobilienscout24.de/agb/verbraucherinformationen.html">Verbraucherinformationen</a>\n                |\n                <a href="https://www.immobilienscout24.de/agb/datenschutz.html">Datenschutz</a> |\n                <a href="https://www.immobilienscout24.de/lp/Geodatenkodex.html">Datenschutz Kodex f\xc3\xbcr\n                    Geodatendienste</a> |\n                <a href="https://sicherheit.immobilienscout24.de">Sicherheit</a>\n            </div>\n            <div>\n                <!--<a href="">Immobiliensuche</a> | -->\n                <a href="https://www.scout24media.com/">Werbung</a> |\n                <a href="https://blog.immobilienscout24.de">Blog</a>\n                <!--|\n            <a href="">Nachbarschaft</a> |\n            <a href="">Gratis! E-Mail-Adresse @t-online.de</a>-->\n            </div>\n            <div>\n                <a href="https://www.immobilienscout24.de/">www.ImmobilienScout24.de</a>\n            </div>\n            <div class="legend">\n                \xc2\xa9 Copyright 1999 - 2021 Immobilien Scout GmbH\n            </div>\n        </div>\n\n    </div>\n\n</body>\n\n</html>\n'
Traceback (most recent call last):
  File "/home/choeffer/Dokumente/flathunter/flathunt.py", line 89, in <module>
    main()
  File "/home/choeffer/Dokumente/flathunter/flathunt.py", line 86, in main
    launch_flat_hunt(config)
  File "/home/choeffer/Dokumente/flathunter/flathunt.py", line 46, in launch_flat_hunt
    hunter.hunt_flats()
  File "/home/choeffer/Dokumente/flathunter/flathunter/hunter.py", line 42, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
  File "/home/choeffer/Dokumente/flathunter/flathunter/hunter.py", line 21, in crawl_for_exposes
    return chain(*[searcher.crawl(url, max_pages)
  File "/home/choeffer/Dokumente/flathunter/flathunter/hunter.py", line 21, in <listcomp>
    return chain(*[searcher.crawl(url, max_pages)
  File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 136, in crawl
    return self.get_results(url, max_pages)
  File "/home/choeffer/Dokumente/flathunter/flathunter/crawl_immobilienscout.py", line 60, in get_results
    soup = self.get_page(search_url, self.driver, page_no)
  File "/home/choeffer/Dokumente/flathunter/flathunter/crawl_immobilienscout.py", line 120, in get_page
    return self.get_soup_from_url(search_url.format(page_no), driver=driver, captcha_api_key=self.captcha_api_key, checkbox=self.checkbox, afterlogin_string=self.afterlogin_string)
  File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 75, in get_soup_from_url
    self.resolvecaptcha(driver, checkbox, afterlogin_string, captcha_api_key)
  File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 151, in resolvecaptcha
    iframe_present = self._check_if_iframe_visible(driver)
  File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 207, in _check_if_iframe_visible
    iframe = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
  File "/home/choeffer/Dokumente/flathunter/venv/lib64/python3.9/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: 
choeffer commented 3 years ago

interestingly, sometimes it is working. But the last days it crashes more often. Maybe they have changed their bot recognition?

choeffer commented 3 years ago

Seems that I can get rid of the error by getting a new IP address.

But I still get

[2021/04/20 23:21:29|config.py         |INFO    ]: Using config /home/choeffer/Dokumente/flathunter/config.yaml
Traceback (most recent call last):
  File "/home/choeffer/Dokumente/flathunter/flathunt.py", line 89, in <module>
    main()
  File "/home/choeffer/Dokumente/flathunter/flathunt.py", line 86, in main
    launch_flat_hunt(config)
  File "/home/choeffer/Dokumente/flathunter/flathunt.py", line 46, in launch_flat_hunt
    hunter.hunt_flats()
  File "/home/choeffer/Dokumente/flathunter/flathunter/hunter.py", line 42, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
  File "/home/choeffer/Dokumente/flathunter/flathunter/hunter.py", line 21, in crawl_for_exposes
    return chain(*[searcher.crawl(url, max_pages)
  File "/home/choeffer/Dokumente/flathunter/flathunter/hunter.py", line 21, in <listcomp>
    return chain(*[searcher.crawl(url, max_pages)
  File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 136, in crawl
    return self.get_results(url, max_pages)
  File "/home/choeffer/Dokumente/flathunter/flathunter/crawl_immobilienscout.py", line 60, in get_results
    soup = self.get_page(search_url, self.driver, page_no)
  File "/home/choeffer/Dokumente/flathunter/flathunter/crawl_immobilienscout.py", line 120, in get_page
    return self.get_soup_from_url(search_url.format(page_no), driver=driver, captcha_api_key=self.captcha_api_key, checkbox=self.checkbox, afterlogin_string=self.afterlogin_string)
  File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 75, in get_soup_from_url
    self.resolvecaptcha(driver, checkbox, afterlogin_string, captcha_api_key)
  File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 151, in resolvecaptcha
    iframe_present = self._check_if_iframe_visible(driver)
  File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 207, in _check_if_iframe_visible
    iframe = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
  File "/home/choeffer/Dokumente/flathunter/venv/lib64/python3.9/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: 

When enabling 100% recognition at 2 captcha

choeffer commented 3 years ago

Also often same error as above without using 100% recognition at 2 captcha. Any ideas?

choeffer commented 3 years ago

Regarding the 2captcha support, 100% recognition feature only works with Normal captcha.

choeffer commented 3 years ago

Interestingly, sometimes I can still see new flats and it does not crash. But I have not found any pattern. I will investigate further.

yicli commented 3 years ago

Also found that this started to happen very regularly since 19.04

pneismeis commented 3 years ago

same here, maybe it is a issue with their change to a new captcha system? since one week ago there appear other captchas like before ..

choeffer commented 3 years ago

@pneismeis I have the same assumption. And it might be that a captcha is recognized but the old pattern for recaptcha v2 is used. Therefore, the programm is waiting for a response it will never get.

 File "/home/choeffer/Dokumente/flathunter/flathunter/crawl_immobilienscout.py", line 120, in get_page
    return self.get_soup_from_url(search_url.format(page_no), driver=driver, captcha_api_key=self.captcha_api_key, checkbox=self.checkbox, afterlogin_string=self.afterlogin_string)
  File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 75, in get_soup_from_url
    self.resolvecaptcha(driver, checkbox, afterlogin_string, captcha_api_key)
  File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 151, in resolvecaptcha
    iframe_present = self._check_if_iframe_visible(driver)
  File "/home/choeffer/Dokumente/flathunter/flathunter/abstract_crawler.py", line 207, in _check_if_iframe_visible
    iframe = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(

I think we could solve it by matching the new captcha system which immoscout uses, maybe with a switch for recaptcha and the new captcha system.

choeffer commented 3 years ago

I think https://2captcha.com/de/2captcha-api#solving_geetest needs to be implemented for immoscout after investigating the new captcha site content of immoscout. I was able to find the gt and challange values there.

In https://github.com/flathunters/flathunter/blob/main/flathunter/abstract_crawler.py#L160-L182 there are some POST GET requests which needs to be modified I guess. But I haven't understood the whole construct of flathunter around this function so far.

codders commented 3 years ago

Hey @choeffer ,

Thanks for investigating that - it really helps a lot to have someone report the issue and do some investigating. I'll try and take a look at a fix in the coming days.

choeffer commented 3 years ago

This part seems to contain the relevant code on the bot protection site:

<script src='https://www.google.com/recaptcha/api.js?hl=de'></script>
                    <script src="https://static.geetest.com/static/tools/gt.js"></script>
                       <script>
                          initGeetest({
                            gt: "0fdbade8a0fe41cba0ff758456d23dfa",
                            challenge: "5b64391babf2bc5a6b2d9a8340cd6399",
                            offline: false,
                            new_captcha: true,
                            lang: window.geetestLang || "en",
                          }, function (captchaObj) {
                            captchaObj.onSuccess(function () {
                                var obj = captchaObj.getValidate();
                                solvedCaptcha({
                                    geetest_challenge: obj.geetest_challenge,
                                    geetest_seccode: obj.geetest_seccode,
                                    geetest_validate: obj.geetest_validate,
                                    data: "3:X41YXeKEoY0Jt0g2trLvbg==:/iA+r889CvCKwh46gxWwkl1izbJlcVCnnU54hH/WLFm69/FkZEjLxcTiMnxho+Rf:YZ/wjiT5RG6qmrxPKDCpTrpoB+jZfHm259Ys8WNH71Q="
                                });
                            });
                            captchaObj.appendTo('#captcha-box');
                          });
                       </script>

                    <script>
                        function solvedCaptcha(payload) {
                            const timeoutMs = 10000;
                            protectionSubmitCaptcha("geetest", payload, timeoutMs, "3:ISKYPxyVqWelP+kqAzjRkg==:W3jUem1HommRbe3pRu6ZAlFZGCt5pbcLOTcmK7jsFzF9Pa+Wd+KxEpqATLDsObJm5H8SFp0FslvUQFssA0Jo/broaq7x/D42lyFauv5P+yQFjfk98ioAdzqUNu1kn/B+rAy3jOyWoCvzvn2lalTp09UMvb9PjwRKL+mUWLlft2nqX54cbQHxb762Awms0LqJ:d+muV3Xl0DmwjVBUgrIEuK64ZGO0gzWL6etLd2BugSA=")
                                .then(
                                    function() {
                                        window.location.reload(true);
                                    },
                                    function(error) {
                                        console.log(error);
                                    },
                                );
                        }
                    </script>

@codders I hope this snippet helps you as well. Thanks again for your effort of fixing and maintaining this useful tool!

choeffer commented 3 years ago

@codders Just one thought. Maybe it might be possible and useful to leave the captchav2 code in place and add the geetest code with a switch, as they might use both in parallel and decide from time to time which they deliver. So the program could decide on the fly which method to use for the delivered captcha (if this is possible and easy to implement).

choeffer commented 3 years ago

https://github.com/2captcha/2captcha-python could help to integrate many solvers at once with a similar pattern without sending plain POST GET commands.

choeffer commented 3 years ago

https://2captcha.com/de/p/geetest has some more information about geetest and a python code example. Somehow the infos on the 2captcha website are a bit cluttered.

choeffer commented 3 years ago

And sometimes it is only needed to click a button to verify that you are not a robot. So they do not always roll-out the geetest puzzle. I will try to provide this code snippet as well if it appears the next time.

codders commented 3 years ago

I had a look at this today. It was possible for me to detect that GeeTest was there, and also to do that without disabling the recaptcha support. Unfortunately, I get the ERROR_CAPCHA_UNSOLVABLE back from the 2captcha API whenever I submit a GeeTest token and challenge.

I think this is related to the detail in the 2captcha API docs:

Important: you should get a new challenge value for each request to our API. Once captcha was loaded on the page the challenge value becomes invalid. You should inspect requests made to the website when page is loaded to identify a request that gets a new challenge value. Then you should make such request each time to get a valid challenge value.

When the Selenium browser gets the GeeTest token and challenge, it's already been displayed in the browser and a bunch of other Javascript has already run, which means that the challenge is already invalidated. The design of that page is tricky - the challenge isn't there when the page first loads, and it shows up later after Javascript has run.

So it's a pretty nasty (and not much fun) reverse engineering problem to get those details in a clean way, and I haven't had any success so far. I'm very open to other people taking a shot at it, but I don't feel like I'm in a place where I can dig deeper into it myself right now.

Sorry for that. In case you / anyone is interested, I've attached what I was trying to do here.

diff --git a/flathunter/abstract_crawler.py b/flathunter/abstract_crawler.py
index 35a4c84..56b0df3 100644
--- a/flathunter/abstract_crawler.py
+++ b/flathunter/abstract_crawler.py
@@ -71,7 +71,13 @@ class Crawler:
             return self.get_soup_with_proxy(url)
         if driver is not None:
             driver.get(url)
-            if re.search("g-recaptcha", driver.page_source):
+            sleep(4)
+            self.__log__.debug("Checking geetest: %s" % driver.execute_script(f'return window.GeeChallenge'))
+            if re.search("initGeetest", driver.page_source):
+                self.__log__.debug("Found geetest captcha - attempting to solve")
+                self.resolvegeetestcaptcha(driver, captcha_api_key)
+            elif re.search("g-recaptcha", driver.page_source):
+                self.__log__.debug("Found recaptcha captcha - attempting to solve")
                 self.resolvecaptcha(driver, checkbox, afterlogin_string, captcha_api_key)
             return BeautifulSoup(driver.page_source, 'html.parser')
         return BeautifulSoup(resp.content, 'html.parser')
@@ -147,6 +153,12 @@ class Crawler:
         """Loads additional detalis for an expose. Should be implemented in the subclass"""
         return expose

+    def resolvegeetestcaptcha(self, driver, api_key: str):
+        gt = re.search('gt: \"([^"]+)\",', driver.page_source)
+        challenge = re.search('challenge: \"([^"]+)\",', driver.page_source)
+        if (gt is not None and challenge is not None):
+            self._solve_geetest(driver, api_key, gt.group(1), challenge.group(1))
+
     def resolvecaptcha(self, driver, checkbox: bool, afterlogin_string: str = "", api_key: str = None):
         iframe_present = self._check_if_iframe_visible(driver)
         if checkbox is False and afterlogin_string == "" and iframe_present:
@@ -180,6 +192,28 @@ class Crawler:
         driver.execute_script(f'solvedCaptcha("{recaptcha_answer}")')
         self._check_if_iframe_not_visible(driver)

+    def _solve_geetest(self, driver, api_key: str, gt: str, challenge: str):
+        url = driver.current_url
+        self.__log__.debug(f"Attempting with gt: {gt} challenge: {challenge}")
+        session = requests.Session()
+        postrequest = (
+            f"http://2captcha.com/in.php?key={api_key}&method=geetest&gt={gt}&challenge={challenge}&pageurl={url}"
+        )
+        captcha_id = session.post(postrequest).text.split("|")[1]
+        geetest_answer = session.get(f"http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}").text
+        while "CAPCHA_NOT_READY" in geetest_answer:
+            sleep(5)
+            self.__log__.debug("Captcha status: %s", geetest_answer)
+            geetest_answer = session.get(f"http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}").text
+        self.__log__.debug("Captcha promise: %s", geetest_answer)
+#        recaptcha_answer = recaptcha_answer.split("|")[1]
+#        driver.execute_script(f'document.getElementById("g-recaptcha-response").innerHTML="{recaptcha_answer}";')
+        # TODO: Below function call can be different depending on the websites implementation. It is responsible for
+        #  sending the the promise that we get from recaptcha_answer. For now, if it breaks, it is required to
+        #  reverse engineer it by hand. Not sure if there is a way to automate it.
+#        driver.execute_script(f'solvedCaptcha("{recaptcha_answer}")')
+#        self._check_if_iframe_not_visible(driver)
+
     def _clickcaptcha(self, driver, checkbox: bool):
         driver.switch_to.frame(driver.find_element_by_tag_name("iframe"))
         recaptcha_checkbox = driver.find_element_by_class_name("recaptcha-checkbox-checkmark")
choeffer commented 3 years ago

https://youtu.be/oUKBX0lleUY?t=149 seems like there are three hidden fields which needs to be filled out. And they are already present before the puzzle is loaded in a normal Firefox session. The puzzle only loads if you click the button. @codders does this info helps you? I From my understanding the hidden fields need to be filled out before executing any further scripts.

choeffer commented 3 years ago

You could also use https://2captcha.com/demo/geetest to verify if the python code is working properly and if the problem is specific to immoscout. They provide a geetest captcha on that page.

choeffer commented 3 years ago

captcha_immo

choeffer commented 3 years ago

captcha_immo2

This happens after clicking the button.

choeffer commented 3 years ago

And after moving the slider it seems that the hidden fields are filled out and are submitted. (But I only could barely see it as it happened very fast).

codders commented 3 years ago

As I said, I've taken a look and it's complicated, for exactly the reasons you're describing. If someone wants to take a deeper look using the clues in this thread, they are very welcome. I'm not available to dive deeper into this right now.

choeffer commented 3 years ago

Ah okay, I thought it might help you. Thanks again for having a look at the issue. Maybe someone else has a good idea how to fix it.

mrzagit commented 3 years ago

maybe you could scrape immosuchmaschine.de since this page is also scraping ImmobilienScout24 and a few other relatively unknown pages

choeffer commented 3 years ago

@codders I tried to dig a bit further with the help of your diff, see above mentioned commit. I used the python debugger and was able to verify your result. Interestingly, this also happens on https://2captcha.com/de/demo/geetest where I would expect it to work.

#Both taken from the website
(Pdb) gt = '81388ea1fc187e0c335c0a8907ff2625'
(Pdb) challenge = 'e4d5929ab1505b0b6a081244d2041403'
(Pdb) url = 'https://2captcha.com/de/demo/geetest'
(Pdb) session = requests.Session()
(Pdb) postrequest = (f"http://2captcha.com/in.php?key={api_key}&method=geetest&gt={gt}&challenge={challenge}&pageurl={url}")
(Pdb) postrequest
'http://2captcha.com/in.php?key=XYZ&method=geetest&gt=81388ea1fc187e0c335c0a8907ff2625&challenge=e4d5929ab1505b0b6a081244d2041403&pageurl=https://2captcha.com/de/demo/geetest'
(Pdb) captcha_id = session.post(postrequest).text.split("|")[1]
(Pdb) session.get(f"http://2captcha.com/res.php?key={api_key}&action=get&id={captcha_id}").text
'ERROR_CAPTCHA_UNSOLVABLE'

I also found out that the value of gt seems not to change at all (on immoscout).

choeffer commented 3 years ago

screen

The magic seems to happen there. I placed a debugger here

    ...
    def _solve_geetest(self, driver, api_key: str, gt: str, challenge: str):
        url = driver.current_url
        import pdb; pdb.set_trace()
        ...

and deactivated --headless in config.yaml.

After looking at the gt.js code I found out that

(Pdb) driver.execute_script('return window.GeeGT')
'0fdbade8a0fe41cba0ff758456d23dfa'
(Pdb) driver.execute_script('return window.GeeChallenge')
'9f2f53e4d928b619e2cfa468cafbfab9'

are passed to window.initGeetest = function (userConfig, callback) { by userConfig.

So the real question is, where is the challange generated? I am not familiar with JS, but for me it looks like the challenge is requested very early in the process and then passed to initGeetest. Does this helps you to get further @codders ?

choeffer commented 3 years ago

grafik

There seems also a mechanism in place which requests a new challenge after reloading the website. Searching for e2ac69af207a8ea398e7f2526961d6f1 revealed this two GET requests.

choeffer commented 3 years ago

reset

And this challenge reset functionality is triggered every nine minutes by default. It uses the old challenge and receives a new challenge. So we should also be able to use this method to receive a new challenge and submit the response to 2captcha as long as it is not triggered internally by some JS code and such gets invalid. But if we trigger it manually before the automatic reset and just store/use the response, it should be fine I guess.

choeffer commented 3 years ago

There are always two requests made. First one to immoscout, then a second one to geetest. Lets investigate the first pair.

first_reset

The request to immoscout looks like the following:

https://www.immobilienscout24.de/assets/immo-1-17?d=www.immobilienscout24.de

Response:

{"token":"3:cR7PM0jDQBCnsLosc0eDFw==:+MdZiMFUVJ+jRhuDq4/U3C44/JGF1km4dDm5OznBxm3NhahMbPpuPFoFb93HK14LXf+xvqOsCvWBlgybpHcgeCiNtCnCkLFTjDK6MJTogaJBW770R+2fNAplVCq64AMj78xqewNuT24Uu1lT8m95dx1OuJdB8DGYWks4snVrSeNQg6xg4ugX0VjXXmkbcpH/rloPJmBzJd3Am7iueuAN1OlZqfbwNBOAbRQAlEESU6cz93BCosnUzn2wWVkJ66jO84upI9viSCtRkB+Dqyc99ibodXpRC30xUOejPc94V7chV0qTRitDoictNW1Y2MNI3S4B7boQFqT93HuCj27m0tS24LwUBD0GfMxzC+Tr0myAblvvQYp11syZQK9eBDNh0paRM32yuHaKatG/wjBJJyueQVI5MdSYT8kOqohgeyVjO5mxGxAhiNX58hWOfTV0:4saj+Y7Q/zsplO2EECWWCmHTF7QUrcCYRVHuzC9AkNM=","renewInSec":680,"cookieDomain":".immobilienscout24.de"}

It contains the following:


The first reset request to geetest looks like the following:

https://api.geetest.com/reset.php?gt=0fdbade8a0fe41cba0ff758456d23dfa&challenge=6d213da1d834769551af13cb808a9202&lang=de&w=BLDHwP6bcyA0Dbf2X3wvAAz6LZ(LrYedFaf74Ult6NgRbYbCdLQ3OKhj5OfLH4HA1ZP1DA(TtZhR6RGLlEnZpwM09e3VDO5drPM7o4hMCNbydw6fytKniZdckK7YZQY(P8RBA4d2uTPBCsQpA4vjHMK0qru5p6dQXk42GiDm)zhOuHa5HDFJGaPNydlm8zRejb8nmJcHh5)wmkYRXjbnBxc4vCRdBpFCI3WASRy(KGL7yCgWeE6uq4ozKoQkvAlCOjXTi3UM1iNJdIjT1057G7atogvlaCQNFbU(uAR7NPreqQBWLlYKQkyB0dszoEMw9t6SYxPHXbULQ80h(SSU9FU50TltT8YiFwhWtDj1SIU3rgMArSf3vjuwiD6r06CuEbK)JZBfgIGWA)N38WOwGQvaSWWPlQkhNLPFidSCU)OLsdI1mvRs8eSSrUWfVqi9v1yBnafzDQ8SmqadRRUGOBCZH3ydotxTnmche9apnwmXj)mgKlmQHXsYHrYSoU4rdnioDH0cwrAfy4hbRK9LR6(JYdfiJma2DuvI4IaAFjlDFvB5zu8SFdfsHt9u1W5CN(tdVEzep5YhmppoYCYbJwtv7pcggQ(uhBq43HqR0fh6S8BhBdDH2FttBFwjKzCrhj3qQsCAPAH(KCbIuyYEqAyC1eG)oI(MwO2AWSHfBrT5w1lqbPGDqAho(T76TPw4Df1piHUkLgQ)dGTrDQgmlrdN5uwp8fBv6iVz3FEf5d4kN5DzRXOwpEnYqUwmVtnAIAezQ36ROYcYFO0kpyxc9fkIknEs7A7bmNP6Rt4CyxVj4RZZwFRVF1yD)Zv2ogE51xPCQyimMcHVbyh(IrEl6LhFtOTCgBPlqGSzQuDx(BA7MAkP2Lf6y36t490oSk60W01rcPHvKCDKprbgRx8Ngw..691b23f5664dd8060df5bd0a4af6d2b56deb0f9ffe769ccd7d03ed19f9fccbff03445aee32a0a5fd0997d83f86df41b1d6d0c35e46081e1a0d9704c479e92591254984aa6a994ac23846ecb3b036fa541aa3c6eaf37c2b164bc4cca5e84b64b12ea54bcf095ae864a3cf3970166ad127f61bba2547bf4bca2e32506124f863cc&pt=0&client_type=web&callback=geetest_1622894289835

It contains the following:

Response:

geetest_1622894289835({"status": "success", "data": {"s": "2d323943", "c": [12, 58, 98, 36, 43, 95, 62, 15, 12], "challenge": "f0c3a0886226e2fb735fbe833d177665"}})

It contains the following:

The reset requests afterwards use the retrieved challenge value from the reset request before. So it should be possible to request a new token by saving the old challenge value. As seen in https://github.com/flathunters/flathunter/issues/119#issuecomment-855050008, this is also done by the webiste itself after a timer times out.


Conclusion

We need to find out how w=XYZ (from the reset request) is calculated. Also, how both requests (and response from immscout before with "token":"XYZ") interact with each other, or if both are independent.

BananaMinion commented 3 years ago

Any news on that problem? Or atleast a workaround so its not crashing ?

BananaMinion commented 3 years ago

I got a workaround for not crashing and getting posts every 5th or so time: flathunter/abstract_crawler.py -> Line 204 find the method and add the last 2 lines

def _check_if_iframe_visible(self, driver: selenium.webdriver.Chrome):
        try:
            iframe = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
                (By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
            return iframe
        except NoSuchElementException:
            print("No iframe found, therefore no chaptcha verification necessary")
        except selenium.common.exceptions.TimeoutException:
            print("Timeout on recaptcha")
choeffer commented 3 years ago

@BananaMinion thanks for digging further. At least this proves, that after some tries/time, they still deliver google recaptcha and not only geetest.

choeffer commented 3 years ago

So a workaround/hack could be to reload the page until a google recaptcha is loaded(?) Then there would be no need to wait for the next round of visiting immoscout, just hoping to get a recaptcha that time.

BananaMinion commented 3 years ago

Im not sure if they switch to the google recaptcha or they just dont show any capture. I could print a debugg message to see if the Captcha is loaded or not

arman-ku commented 3 years ago

I am getting the same.

Im not sure if they switch to the google recaptcha or they just dont show any capture.

I don't get any recaptcha anymore, only GeeCaptcha.

If i understand correctly, flathunter just has to be updated to support geetest, right? https://2captcha.com/2captcha-api#solving_geetest

BananaMinion commented 3 years ago

Yeah, looks like it. If someone with more python skills could add this? I can only php :D

BananaMinion commented 3 years ago

Well, im a step closer but need help now. To bypass GeeTest u need to get the challange and the gt token. Immobilienscout generates the challenge with an ajax call which generates some random function. If anyone has a clue how to get the challenge from this call https://static.geetest.com/static/js/fullpage.9.0.7.js i can implement the GeeTest bypass

lourou commented 2 years ago

I'm experiencing the same here with the Google ReCaptcha error:

(By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
File "~/.local/share/virtualenvs/flathunter-HYqahW9g/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until raise TimeoutException(message, screen, stacktrace)

Full verbose output:

āžœ  flathunter git:(main) pipenv run python flathunt.py
[2021/11/18 13:41:08|config.py         |INFO    ]: Using config ~/flathunter/config.yaml
[2021/11/18 13:41:09|flathunt.py       |DEBUG   ]: Settings from config: <flathunter.config.Config object at 0x10a8c9b50>
[2021/11/18 13:41:09|crawl_immobilienscout.py|DEBUG   ]: Got search URL https://www.immobilienscout24.de/Suche/shape/(...)
Traceback (most recent call last):
  File "flathunt.py", line 95, in <module>
    main()
  File "flathunt.py", line 92, in main
    launch_flat_hunt(config)
  File "flathunt.py", line 47, in launch_flat_hunt
    hunter.hunt_flats()
  File "~/flathunter/flathunter/hunter.py", line 42, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
  File "~/flathunter/flathunter/hunter.py", line 22, in crawl_for_exposes
    for searcher in self.config.searchers()
  File "~/flathunter/flathunter/hunter.py", line 23, in <listcomp>
    for url in self.config.get('urls', list())])
  File "~/flathunter/flathunter/abstract_crawler.py", line 136, in crawl
    return self.get_results(url, max_pages)
  File "~/flathunter/flathunter/crawl_immobilienscout.py", line 60, in get_results
    soup = self.get_page(search_url, self.driver, page_no)
  File "~/flathunter/flathunter/crawl_immobilienscout.py", line 120, in get_page
    return self.get_soup_from_url(search_url.format(page_no), driver=driver, captcha_api_key=self.captcha_api_key, checkbox=self.checkbox, afterlogin_string=self.afterlogin_string)
  File "~/flathunter/flathunter/abstract_crawler.py", line 75, in get_soup_from_url
    self.resolvecaptcha(driver, checkbox, afterlogin_string, captcha_api_key)
  File "~/flathunter/flathunter/abstract_crawler.py", line 151, in resolvecaptcha
    iframe_present = self._check_if_iframe_visible(driver)
  File "~/flathunter/flathunter/abstract_crawler.py", line 208, in _check_if_iframe_visible
    (By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
  File "~/.local/share/virtualenvs/flathunter-HYqahW9g/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
lourou commented 2 years ago

I'm experiencing the same here with the Google ReCaptcha error:

(By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
File "~/.local/share/virtualenvs/flathunter-HYqahW9g/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until raise TimeoutException(message, screen, stacktrace)

Full verbose output:

āžœ  flathunter git:(main) pipenv run python flathunt.py
[2021/11/18 13:41:08|config.py         |INFO    ]: Using config ~/flathunter/config.yaml
[2021/11/18 13:41:09|flathunt.py       |DEBUG   ]: Settings from config: <flathunter.config.Config object at 0x10a8c9b50>
[2021/11/18 13:41:09|crawl_immobilienscout.py|DEBUG   ]: Got search URL https://www.immobilienscout24.de/Suche/shape/(...)
Traceback (most recent call last):
  File "flathunt.py", line 95, in <module>
    main()
  File "flathunt.py", line 92, in main
    launch_flat_hunt(config)
  File "flathunt.py", line 47, in launch_flat_hunt
    hunter.hunt_flats()
  File "~/flathunter/flathunter/hunter.py", line 42, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
  File "~/flathunter/flathunter/hunter.py", line 22, in crawl_for_exposes
    for searcher in self.config.searchers()
  File "~/flathunter/flathunter/hunter.py", line 23, in <listcomp>
    for url in self.config.get('urls', list())])
  File "~/flathunter/flathunter/abstract_crawler.py", line 136, in crawl
    return self.get_results(url, max_pages)
  File "~/flathunter/flathunter/crawl_immobilienscout.py", line 60, in get_results
    soup = self.get_page(search_url, self.driver, page_no)
  File "~/flathunter/flathunter/crawl_immobilienscout.py", line 120, in get_page
    return self.get_soup_from_url(search_url.format(page_no), driver=driver, captcha_api_key=self.captcha_api_key, checkbox=self.checkbox, afterlogin_string=self.afterlogin_string)
  File "~/flathunter/flathunter/abstract_crawler.py", line 75, in get_soup_from_url
    self.resolvecaptcha(driver, checkbox, afterlogin_string, captcha_api_key)
  File "~/flathunter/flathunter/abstract_crawler.py", line 151, in resolvecaptcha
    iframe_present = self._check_if_iframe_visible(driver)
  File "~/flathunter/flathunter/abstract_crawler.py", line 208, in _check_if_iframe_visible
    (By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
  File "~/.local/share/virtualenvs/flathunter-HYqahW9g/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
BananaMinion commented 2 years ago

Yeah, that will not be fixed soon. What i did is to rewrite to code a little and start a normal browser instead of a headless. In this browser i got the plugin from 2 captcha. Thats works for me

dnberlin commented 2 years ago

Which browser are you using ? @BananaMinion Could you share your changes ?

BananaMinion commented 2 years ago

https://github.com/flathunters/flathunter/issues/134#issuecomment-973226074 I will as soon as i have time

intxcc commented 2 years ago

Issue could be fixed with

driver.execute_cdp_cmd('Network.setBlockedURLs', {"urls": ["https://api.geetest.com/get.*"]})
driver.execute_cdp_cmd('Network.enable', {})

to prevent chrome from retrieving the cpatcha before 2captcha was able to do so

BananaMinion commented 2 years ago

Does yours work? I didnt know this commands :) But im no python progger :D

intxcc commented 2 years ago

Yeah, tested it and the captcha gets solved and it does again find new flats :) Me neither, found that via https://stackoverflow.com/a/67850301

BananaMinion commented 2 years ago

Ah nice - i found a solution with a non headless browser - but i think yours is better

codders commented 2 years ago

This code is merged now. @choeffer @dnberlin do you want to see if this works for you?

dnberlin commented 2 years ago

Works awesome! We can close this issue I think.

lourou commented 2 years ago

Works perfectly here too! Thanks :)

emibonezzi commented 2 years ago

Guys I have a similar problem on a script that I wrote. Can you give it a check? Iā€™m desperate for a solution

https://stackoverflow.com/questions/73336088/how-to-send-the-geetest-tokens-once-you-get-the-solutions-from-anti-captcha-usin