flathunters / flathunter

A bot to help people with their rental real-estate search. 🏠🤖
GNU Affero General Public License v3.0
855 stars 183 forks source link

selenium.common.exceptions.TimeoutException ImmoScout24 #272

Open flyingdodo11 opened 1 year ago

flyingdodo11 commented 1 year ago

Hi guys, I'm trying to setup the flathunter for ImmoScout24. Already tried it with ebay-kleinanzeigen und immowelt with success.

I already checked all other issues regarding this problem like Issue214, none of the solutions worked for me.

Also i tried it on MacOS and Ubuntu 20.04 with the normal version and the docker version. I always get the same errors.

[2022/12/08 10:51:24|_common.py              |ERROR   ]: Giving up get_soup_from_url(...) after 3 tries (selenium.common.exceptions.TimeoutException: Message: 
Stacktrace:
#0 0x56378df8f2a3 <unknown>
#1 0x56378dd4df77 <unknown>
#2 0x56378dd8a80c <unknown>
#3 0x56378dd8aa71 <unknown>
#4 0x56378ddc4734 <unknown>
#5 0x56378ddaab5d <unknown>
#6 0x56378ddc247c <unknown>
#7 0x56378ddaa903 <unknown>
#8 0x56378dd7dece <unknown>
#9 0x56378dd7efde <unknown>
#10 0x56378dfdf63e <unknown>
#11 0x56378dfe2b79 <unknown>
#12 0x56378dfc589e <unknown>
#13 0x56378dfe3a83 <unknown>
#14 0x56378dfb8505 <unknown>
#15 0x56378e004ca8 <unknown>
#16 0x56378e004e36 <unknown>
#17 0x56378e020333 <unknown>
#18 0x7fdf58a1bea7 start_thread)
codders commented 1 year ago

Hi @flyingdodo11 ,

How much RAM do you have available for your docker containers? I think the docker daemon is by default not very generous on Mac. You should have at least 1GB of memory to run the Immoscout crawler.

marcelmindemann commented 1 year ago

I am running Flathunter in docker on Linux with no resource limits, and I am getting the same issue. More logging output:

flathunter  |   File "/usr/src/app/flathunter/hunter.py", line 54, in hunt_flats                                                                                                                           [0/1631]
flathunter  |     for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
flathunter  |   File "/usr/src/app/flathunter/hunter.py", line 34, in crawl_for_exposes
flathunter  |     for searcher in self.config.searchers()
flathunter  |   File "/usr/src/app/flathunter/hunter.py", line 35, in <listcomp>
flathunter  |     for url in self.config.target_urls()])
flathunter  |   File "/usr/src/app/flathunter/hunter.py", line 25, in try_crawl
flathunter  |     return searcher.crawl(url, max_pages)
flathunter  |   File "/usr/src/app/flathunter/abstract_crawler.py", line 142, in crawl
flathunter  |     return self.get_results(url, max_pages)
flathunter  |   File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 57, in get_results
flathunter  |     soup = self.get_page(search_url, self.driver, page_no)
flathunter  |   File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 145, in get_page
flathunter  |     afterlogin_string=self.afterlogin_string
flathunter  |   File "/usr/local/lib/python3.7/site-packages/backoff/_sync.py", line 105, in retry
flathunter  |     ret = target(*args, **kwargs)
flathunter  |   File "/usr/src/app/flathunter/abstract_crawler.py", line 76, in get_soup_from_url
flathunter  |     self.resolve_recaptcha(driver, checkbox, afterlogin_string)
flathunter  |   File "/usr/local/lib/python3.7/site-packages/backoff/_sync.py", line 105, in retry
flathunter  |     ret = target(*args, **kwargs)
flathunter  |   File "/usr/src/app/flathunter/abstract_crawler.py", line 190, in resolve_recaptcha
flathunter  |     iframe_present = self._wait_for_iframe(driver)
flathunter  |   File "/usr/src/app/flathunter/abstract_crawler.py", line 248, in _wait_for_iframe
flathunter  |     (By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
flathunter  |   File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 95, in until
flathunter  |     raise TimeoutException(message, screen, stacktrace)
flathunter  | selenium.common.exceptions.TimeoutException: Message:
flathunter  | Stacktrace:
flathunter  | #0 0x55d9051522a3 <unknown>
flathunter  | #1 0x55d904f10f77 <unknown>
flathunter  | #2 0x55d904f4d80c <unknown>
flathunter  | #3 0x55d904f4da71 <unknown>
flathunter  | #4 0x55d904f87734 <unknown>
flathunter  | #5 0x55d904f6db5d <unknown>
flathunter  | #6 0x55d904f8547c <unknown>
flathunter  | #7 0x55d904f6d903 <unknown>
flathunter  | #8 0x55d904f40ece <unknown>
flathunter  | #9 0x55d904f41fde <unknown>
flathunter  | #10 0x55d9051a263e <unknown>
flathunter  | #11 0x55d9051a5b79 <unknown>
flathunter  | #12 0x55d90518889e <unknown>
flathunter  | #13 0x55d9051a6a83 <unknown>
flathunter  | #14 0x55d90517b505 <unknown>
flathunter  | #15 0x55d9051c7ca8 <unknown>
flathunter  | #16 0x55d9051c7e36 <unknown>
flathunter  | #17 0x55d9051e3333 <unknown>
flathunter  | #18 0x7fd708a11ea7 start_thread
flathunter  |
flyingdodo11 commented 1 year ago

@codders Already tried that, doesnt work..

codders commented 1 year ago

Okay. I've made a PR #273 - you can try and see if that fixes your issue. Unfortunately it's not something I can reproduce locally, so it's a bit of guess work. Let me know!

vitalik239 commented 1 year ago

@codders unfortunately didn't help. Immobilienscout crawling won't work even with increased timeout. IS24 variable is not found with --headless driver argument, while removing it solves the problem only for the first loop.

flyingdodo11 commented 1 year ago

Doesn't work for me either.

ivanarkhipov commented 1 year ago

Same error for me. Though it was running fine earlier today

codders commented 1 year ago

I had a look at this again today. What I can see is that also if I run from the command line (without docker), I get the timeout / cannot find IS24 variable message. Debugging further, I can see that in these cases the bot detection has kicked in:

2022-12-13-162032_1272x1515_scrot

If I disable the '--headless' argument (or unset FLATHUNTER_HEADLESS_BROWSER), the immoscout crawl works as normal. Somehow, the version I have running in the cloud (which uses the headless argument and docker) is still succeeding.

The undetected_chromedriver package is supposed to make it impossible to detect the fact that we're driving the browser from a script, and that seemed to help us for a while, but I guess it's a cat and mouse game. If anyone has any hot tips on avoiding bot detection, those would be most welcome :)

ozeidan commented 1 year ago

I got this partially fixed: undetected_chromedriver provides a Docker image in which it is possible to run chromedriver without the --headless flag. The image creates a virtual display on which the chrome window is rendered. I got it to work by basing the Dockerfile of this repository on the one of undetected-chromedriver. But the browser crashes quite often, I'm still looking to fix that.

codders commented 1 year ago

That's exciting news - thanks for taking a look! Often when I've seen crashes it's been about memory usage, but I guess you've already tried that. If you make a draft PR I can also have a go at running it here and see what happens.

flyingdodo11 commented 1 year ago

Any updates on this?

codders commented 1 year ago

@flyingdodo11 I haven't heard anything. I don't know if this helps you, but if you're just searching in Berlin and you're okay with a pretty default setup, you can also just use the hosted version: https://flathunter.codders.io . That's running okay right now (and crawling immoscout still).

anamyk commented 1 year ago

I also ran into this issue. Any update or workaround would be great.

hruzgar commented 1 year ago

I tried this method, by basing my docker image from the undetected chromedriver like this:

FROM ultrafunk/undetected-chromedriver:latest

Also i set the flags "--no-sandbox" and "--disable-setuid-sandbox". I didn't set the "--headless" flag (That's the hole point) ..but it didn't work. I still couln't get past the bot detection. Then i though, that my ip address might be blacklisted and connected my container to a vpn (thanks to nordvpn-docker) ...but still no success

First i get this message for a period of time:

[2023/01/28 12:13:22|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...

then in the end it shows a long error message and stops

codders commented 1 year ago

@hruzgar Can you copy the long error message?

infctr commented 1 year ago

I've also tried running a job on Google Cloud Run based on the ultrafunk/undetected-chromedriver image, however the container stops immediately after executing

running: /bin/sh -c python cloud_job.py
running keepUpScreen()
Container called exit(0)

What am I missing here?

hruzgar commented 1 year ago

this is the full lifecycle of the execution

haso:flathunter/ (main✗) $ sudo docker run --net=container:vpn --mount type=bind,source=/opt/flath
unter/config.yaml,target=/config.yaml flathunter
running: python flathunt.py -c /config.yaml
running keepUpScreen()
[2023/01/28 14:14:18|config.py               |INFO    ]: Using config path /config.yaml
[2023/01/28 14:14:18|chrome_wrapper.py       |INFO    ]: Initializing Chrome WebDriver for crawler
...
[2023/01/28 14:14:19|patcher.py              |INFO    ]: patching driver executable /root/.local/s
hare/undetected_chromedriver/753613c1953be3c0_chromedriver
[2023/01/28 14:14:32|abstract_crawler.py     |INFO    ]: Timeout waiting for iframe element - no c
aptcha verification necessary?
[2023/01/28 14:14:32|crawl_immobilienscout.py|WARNING ]: Unable to find IS24 variable in window
[2023/01/28 14:14:32|crawl_immobilienscout.py|ERROR   ]: IS24 bot detection has identified our scr
ipt as a bot - we've been blocked
[2023/01/28 14:14:34|imagetyperz_solver.py   |INFO    ]: Trying to solve geetest.
[2023/01/28 14:14:35|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:41|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:46|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:51|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:56|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:02|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:07|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:12|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:17|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:23|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:28|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:33|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:38|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:44|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:49|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:54|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:59|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:05|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:10|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:15|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:20|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:26|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:31|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:36|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:41|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:47|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:53|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:58|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:04|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:09|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:14|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:19|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:25|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:30|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:35|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:40|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:46|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:51|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:56|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:01|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:07|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:12|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:17|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:22|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:28|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:33|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:38|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:43|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:49|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:00|_common.py              |INFO    ]: Backing off resolve_geetest(...) for 1.0s
 (flathunter.captcha.captcha_solver.CaptchaUnsolvableError)
[2023/01/28 14:19:01|imagetyperz_solver.py   |INFO    ]: Trying to solve geetest.
[2023/01/28 14:19:01|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:06|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:12|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:17|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:22|imagetyperz_solver.py   |INFO    ]: Captcha is not ready yet, waiting...
Traceback (most recent call last):
  File "/usr/src/app/flathunt.py", line 109, in <module>
    main()
  File "/usr/src/app/flathunt.py", line 105, in main
    launch_flat_hunt(config, heartbeat)
  File "/usr/src/app/flathunt.py", line 29, in launch_flat_hunt
    hunter.hunt_flats()
  File "/usr/src/app/flathunter/hunter.py", line 54, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/hunter.py", line 33, in crawl_for_exposes
    return chain(*[try_crawl(searcher, url, max_pages)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/hunter.py", line 33, in <listcomp>
    return chain(*[try_crawl(searcher, url, max_pages)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/hunter.py", line 25, in try_crawl
    return searcher.crawl(url, max_pages)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/abstract_crawler.py", line 142, in crawl
    return self.get_results(url, max_pages)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 57, in get_results
    soup = self.get_page(search_url, self.driver, page_no)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 145, in get_page
    return self.get_soup_from_url(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/backoff/_sync.py", line 105, in retry
    ret = target(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/app/flathunter/abstract_crawler.py", line 77, in get_soup_from_url
    return BeautifulSoup(driver.page_source, 'html.parser')
                         ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 740, in
 __getattribute__
    return super().__getattribute__(item)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 541,
 in page_source
    return self.execute(Command.GET_PAGE_SOURCE)["value"]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 440,
 in execute
    self.error_handler.check_response(response)
  File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 2
45, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of
page crash
from unknown error: cannot determine loading status
from tab crashed
  (Session info: chrome=109.0.5414.119)
Stacktrace:
#0 0x5613f8e04303 <unknown>
#1 0x5613f8bd8bbd <unknown>
#2 0x5613f8bc3233 <unknown>
#3 0x5613f8bc1c77 <unknown>
#4 0x5613f8bc2408 <unknown>
#5 0x5613f8bcf67f <unknown>
#6 0x5613f8bd02d2 <unknown>
#7 0x5613f8be0fd0 <unknown>
#8 0x5613f8be534b <unknown>
#9 0x5613f8bc29c5 <unknown>
#10 0x5613f8be0bd2 <unknown>
#11 0x5613f8c4d7a0 <unknown>
#12 0x5613f8c35753 <unknown>
#13 0x5613f8c08a14 <unknown>
#14 0x5613f8c09b7e <unknown>
#15 0x5613f8e5332e <unknown>
#16 0x5613f8e56c0e <unknown>
#17 0x5613f8e39610 <unknown>
#18 0x5613f8e57c23 <unknown>
#19 0x5613f8e2b545 <unknown>
#20 0x5613f8e786a8 <unknown>
#21 0x5613f8e78836 <unknown>
#22 0x5613f8e93d13 <unknown>
#23 0x7fc0d591cea7 start_thread

[2023/01/28 14:19:29|__init__.py             |INFO    ]: ensuring close
hruzgar commented 1 year ago

@infctr you need to set "--no-sandbox" and "--disable-setuid-sandbox" flags in your config.yaml file. also don't set the "--headless" flag

infctr commented 1 year ago

@hruzgar Did imagetyperz work for you before with IS24? I had a similar Captcha is not ready yet error so I had to switch to 2captcha

hruzgar commented 1 year ago

yeah it was working (and is still working) on my main pc. But i want to run the bot on my server to not get a high energy bill (my pc is beefy). That's the reason i am trying to get it working inside docker without any gui.. I could still try if it'll work with 2captcha though. Worth a try fs

infctr commented 1 year ago

I've started the image with these driver flags but it didn't make a difference in the container unfortunately

 "--no-sandbox",
"--disable-gpu",
"--disable-setuid-sandbox",
hruzgar commented 1 year ago

I just tried running the bot locally on my pc again. And the weird thing is that it works with the "--headless" argument for a certain amount of time, before it fails again but as soon as i comment the "--headless" flag and run the bot again, it fires up a chrome tab and it sais that i am a robot and thus not get access to the site.

codders commented 1 year ago

@infctr The cloud_job script is expected to run once and then quit. It is designed to be installed as a cron job running on a timer. The flathunt script is configurable either to run in a loop, or as a one-time job.

codders commented 1 year ago

@hruzgar CaptchaUnsolvableError sometimes comes up if it just can't solve the captcha, but it should retry and that shouldn't be fatal. Usually a message like 'session deleted because of page crash' comes after the container runs out of memory - are you running with a memory limit on your docker container?