flathunters / flathunter

A bot to help people with their rental real-estate search. 🏠🤖
GNU Affero General Public License v3.0
834 stars 179 forks source link

immoscout24 broken somehow #45

Closed choeffer closed 4 years ago

choeffer commented 4 years ago

I am using the url https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/koeln/wohnung-mieten?sorting=2 and since a few hours, I am just getting a long printout but not any results sent via telgramm bot anymore. Ebay Kleinanzeigen is still working fine.

file.log

If I can provide more info, please let me know.

codders commented 4 years ago

Hey there,

From your logs, it looks like Immoscout has detected that your flathunter is a bot and has blocked it. Can you browse the site normally in a webbrowser?

I've not encountered bot-detection on Immoscout before. It would be interesting to know if other users report the same thing.

choeffer commented 4 years ago

Hey,

thanks for your fast reply. I can browse the website normally via Firefox. And I had the same problem at work as at home. So I assume it is not IP related. Does flathunter use some cookies or so?

choeffer commented 4 years ago

it seems like immoscout is complaining that cookies are not used and JavaScript is disabled.

codders commented 4 years ago

Okay. I see the same thing on my machine. So it looks like they've upgraded their bot detection. I just tried here with a fake user agent (so it looks like Firefox instead of a Python script), but that doesn't help. I also tried here adding cookies support, and that doesn't fix it. I'll need to take a deeper look at what they're detecting, and I don't have any time to do that in the coming weeks I'm afraid.

But thanks for the report - this is something that will be affecting all users.

choeffer commented 4 years ago

I also have gotten the same error page as the crawler in Firefox after doing some manual refreshs this morning. So it is not purely flathunter related. After a Google Captcha I was able to continue using immoscout with Firefox. The used Firefox is without any plugins or ad blockers.

codders commented 4 years ago

Sounds like they just have some new aggressive filtering in place then.

There are a couple of python projects that offer to solve captchas - that would be an option, but it's also not free (though it is very cheap - 1eur / 1000 fetches).

Another option would be to implement the immoscout API. That would mean every flathunter user has to register with them.

On Wed, 5 Aug 2020, 09:50 choeffer, notifications@github.com wrote:

I also have gotten the same error page as the crawler in Firefox after doing some manual refreshs this morning. So it is not purely flathunter related. After a Google Captcha I was able to continue using immoscout with Firefox. The used Firefox is without any plugins or ad blockers.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/flathunters/flathunter/issues/45#issuecomment-669040039, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEK5V5JC72XFAY7OYDT4TR7EFLJANCNFSM4PUWCHAA .

Bolland commented 4 years ago

Same happens to my setup run on a raspberry pi. It also crashes my script every time 🤔

...
</div>\n<div class="main__part1">\n    Du bist ein Mensch aus Fleisch und Blut? Entschuldige bitte, dann hat unser System dich f\xc3\xa4lschlicherweise als Roboter identifiziert. Um unsere Services weiterhin zu nutzen, l\xc3\xb6se bitte diesen kurzen Test.\n</div>\n\n    <div class="main__captcha">\n        \n        <div class="container">\n            \n                    <script>\n                    showBlockPage()\n                    document.writeln(window.captchaDescription || "<p>After completing the CAPTCHA below, you will immediately regain access to the site again.</p>");\n                    </script>\n                <div class="g-recaptcha" data-sitekey="6LeaILIZAAAAALTgLZV1AQXPc2dAsLItNYJ8jVvB" data-callback="solvedCaptcha"></div>\n        </div>\n    </div>\n\n<div class="main__part2">\n\n    <div class="main_part2_header1">Warum f\xc3\xbchren wir diese Sicherheitsma\xc3\x9fnahme durch?</div>\n<div class="main_part2_text1">Mit der Captcha-Methode stellen wir fest, dass du kein Roboter oder eine sch\xc3\xa4dliche Spam-Software bist.  Damit sch\xc3\xbctzen wir unsere Webseite und die Daten unserer Nutzerinnen und Nutzer vor betr\xc3\xbcgerischen Aktivit\xc3\xa4ten.</div>\n\n    <div class="main_part2_header2">Warum haben wir deine Suchanfragen blockiert?</div>\n    <div class="main_part2_text2">Es kann verschiedene Gr\xc3\xbcnde haben, warum wir dich f\xc3\xa4lschlicherweise als Roboter identifiziert haben. M\xc3\xb6glicherweise</div>\n\n</div>\n<div class="main__list">\n<ul>\n    <li>hast du die Cookies f\xc3\xbcr unsere Seite deaktiviert.</li>\n    <li>hast du die Ausf\xc3\xbchrung von JavaScript deaktiviert.</li>\n    <li>nutzt du ein Browser-Plugin eines Drittanbieters, beispielsweise einen Ad-Blocker.</li>\n<li>hast du in kurzer Zeit mehr Anfragen an unser System gestellt, als es \xc3\xbcblicherweise der Fall ist.</li>\n</ul>\n</div>\n\n\n</div>\n\n</div>\n\n<div class="footer">\n    <div class="footer-content">\n\n\n        <div>\n            <a href="https://www.immobilienscout24.de/unternehmen.html">\xc3\x9cber uns</a> |\n            <a href="https://www.immobilienscout24.de/kontakt.html">Kontakt & Hilfe</a> |\n            <a href="https://www.immobilienscout24.de/unternehmen/karriere/">Karriere</a> |\n            <a href="https://www.immobilienscout24.de/sitemap.html">Sitemap</a> |\n            <a href="https://api.immobilienscout24.de">Developer</a> |\n            <a href="https://www.immobilienscout24.de/unternehmen/mediendienst.html">Presseservice</a> |\n            <a href="https://www.immobilienscout24.de/ratgeber/newsletter.html">Newsletter abonnieren</a> |\n            <a href="https://www.immobilienscout24.de/impressum.html">Impressum</a> |\n            <a href="https://www.immobilienscout24.de/agb.html">AGB\'s & Rechtliche Hinweise</a> |\n            <a href="https://www.immobilienscout24.de/agb/verbraucherinformationen.html">Verbraucherinformationen</a> |\n            <a href="https://www.immobilienscout24.de/agb/datenschutz.html">Datenschutz</a> |\n            <a href="https://www.immobilienscout24.de/lp/Geodatenkodex.html">Datenschutz Kodex f\xc3\xbcr Geodatendienste</a> |\n            <a href="https://sicherheit.immobilienscout24.de">Sicherheit</a>\n        </div>\n        <div>\n            <!--<a href="">Immobiliensuche</a> | -->\n            <a href="https://www.scout24media.com/">Werbung</a> |\n            <a href="https://blog.immobilienscout24.de">Blog</a>\n            <!--|\n            <a href="">Nachbarschaft</a> |\n            <a href="">Gratis! E-Mail-Adresse @t-online.de</a>-->\n        </div>\n        <div>\n            <a href="https://www.immobilienscout24.de/">www.ImmobilienScout24.de</a>\n        </div>\n        <div class="legend">\n            \xc2\xa9 Copyright 1999 - 2020 Immobilien Scout GmbH\n        </div>\n    </div>\n\n</div>\n\n</body>\n</html>\n'
Traceback (most recent call last):
  File "flathunter.py", line 85, in <module>
    main()
  File "flathunter.py", line 81, in main
    launch_flat_hunt(config)
  File "flathunter.py", line 41, in launch_flat_hunt
    hunter.hunt_flats()
  File "/home/pi/Development/flathunter/flathunter/hunter.py", line 40, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
  File "/home/pi/Development/flathunter/flathunter/hunter.py", line 21, in crawl_for_exposes
    for searcher in self.config.searchers()
  File "/home/pi/Development/flathunter/flathunter/hunter.py", line 22, in <listcomp>
    for url in self.config.get('urls', list()) ])
  File "/home/pi/Development/flathunter/flathunter/abstract_crawler.py", line 12, in crawl
    return self.get_results(url, max_pages)
  File "/home/pi/Development/flathunter/flathunter/crawl_immobilienscout.py", line 41, in get_results
    while len(entries) < min(no_of_results, self.RESULT_LIMIT) and (max_pages is None or page_no < max_pages):
UnboundLocalError: local variable 'no_of_results' referenced before assignment

I can still access Immoscout without problems via chromium on the Pi though... 🤔

bauer-jan commented 4 years ago

I havnt had a deeper look into the implementation of the crawler. But maybe selenium would help with the bot detection. This is really sad i was so excited when i came accross this tool and wanted to use it for my personal flat hunt ;)

namnoops commented 4 years ago

Same experience here. My first thought was they block an IP that has made too many requests, but I cann access ImmoScout as usual with a browser. I suppose it has to do with the request headers or the lack of cookie and javascript support as was mentioned above.

pcace commented 4 years ago

Hmm... so sad :/ same problem here...

Cheers

choeffer commented 4 years ago

I have tried to use http://html.python-requests.org/ and https://selenium-python.readthedocs.io/ . But I am still getting the Google captcha thingy on immoscout24.

At least it is somehow easy to replace the way how the HTML content is received. After digging through the code, I was able to replace the used Python request package with the above mentioned by just applying changes in

    def get_soup_from_url(self, url):
        """Creates a Soup object from the HTML at the provided URL"""
        resp = requests.get(url)
        if resp.status_code != 200:
            self.__log__.error("Got response (%i): %s", resp.status_code, resp.content)
        return BeautifulSoup(resp.content, 'html.parser')

from https://github.com/flathunters/flathunter/blob/main/flathunter/abstract_crawler.py

for selenium with Chrome

from selenium import webdriver

...

    def get_soup_from_url(self, url):
        """Creates a Soup object from the HTML at the provided URL"""
        driver = webdriver.Chrome()
        driver.get(url)
        resp = driver.page_source()
        driver.quit()
        if resp.status_code != 200:
            self.__log__.error("Got response (%i): %s", resp.status_code, resp.content)
        return BeautifulSoup(resp.content, 'html.parser')

for requests_html

from requests_html import HTMLSession

...

    def get_soup_from_url(self, url):
        """Creates a Soup object from the HTML at the provided URL"""
        session = HTMLSession()
        resp = session.get(url)
        if resp.status_code != 200:
            self.__log__.error("Got response (%i): %s", resp.status_code, resp.content)
        return BeautifulSoup(resp.content, 'html.parser')

With both changes, at least ebay kleinanzeigen is still working fine.

choeffer commented 4 years ago

With the help of https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth and the code

// puppeteer-extra is a drop-in replacement for puppeteer,
// it augments the installed puppeteer with plugin functionality
const puppeteer = require('puppeteer-extra')
const fs = require("fs");

// add stealth plugin and use defaults (all evasion techniques)
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())

// puppeteer usage as normal
puppeteer.launch({ headless: false }).then(async browser => {
  console.log('Running tests..')
  const page = await browser.newPage()
  await page.goto('https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/koeln/wohnung-mieten?sorting=2')
  await page.waitFor(5000)
  const html = await page.content();
  fs.writeFileSync("index.html", html);
  // await page.screenshot({ path: 'testresult.png', fullPage: true })
  await browser.close()
  console.log(`All done, check the screenshot. ✨`)
})

I was able to bypass the bot protection. But right now, this is more a prove of concept. The website is loading fine but continues loading until only an add is shown as the final content. But this could be starting point to bypass the immoscout24 bot protection.

lomoien commented 4 years ago

Too bad I'm having the same issue and can't run flathunter on ImmoScout..

mordax7 commented 4 years ago

Just merged a fix, use the latest code stand from the main branch.

Please let me know if it works now.

lomoien commented 4 years ago

Just merged a fix, use the latest code stand from the main branch.

Please let me know if it works now.

Thank you! Seems to run fine now. Do you know how I can check if the program loops after 5 minutes? For me nothing happens atm after I wait for the looptime configured inside of the config file.

choeffer commented 4 years ago

It works now for me. Thanks for the patch.

mordax7 commented 4 years ago

Just merged a fix, use the latest code stand from the main branch. Please let me know if it works now.

Thank you! Seems to run fine now. Do you know how I can check if the program loops after 5 minutes? For me nothing happens atm after I wait for the looptime configured inside of the config file.

Put the logs to verbose and check the output. I guess this is related to this issue: https://github.com/flathunters/flathunter/issues/50? Let move the chat to there.

It works now for me. Thanks for the patch.

Ok, closing the ticket.

choeffer commented 4 years ago

Does not seem to be solved. It has worked properly for a few times, but now I can see new offers on immoscout24 via Firefox which are not listed by flathunter. Maybe the response status is still 200 and it seems to work fine, but I do not think the requested content is delivered.

choeffer commented 4 years ago

A print() of the HTML content reveals that the ouput is the same as the file.log from the first post.

mordax7 commented 4 years ago

Yes, they rolled out just a new version. It seems like they added cookies to their headers. Its another issue, I created a follow up: https://github.com/flathunters/flathunter/issues/51