Open flyingdodo11 opened 1 year ago
Hi @flyingdodo11 ,
How much RAM do you have available for your docker containers? I think the docker daemon is by default not very generous on Mac. You should have at least 1GB of memory to run the Immoscout crawler.
I am running Flathunter in docker on Linux with no resource limits, and I am getting the same issue. More logging output:
flathunter | File "/usr/src/app/flathunter/hunter.py", line 54, in hunt_flats [0/1631]
flathunter | for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
flathunter | File "/usr/src/app/flathunter/hunter.py", line 34, in crawl_for_exposes
flathunter | for searcher in self.config.searchers()
flathunter | File "/usr/src/app/flathunter/hunter.py", line 35, in <listcomp>
flathunter | for url in self.config.target_urls()])
flathunter | File "/usr/src/app/flathunter/hunter.py", line 25, in try_crawl
flathunter | return searcher.crawl(url, max_pages)
flathunter | File "/usr/src/app/flathunter/abstract_crawler.py", line 142, in crawl
flathunter | return self.get_results(url, max_pages)
flathunter | File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 57, in get_results
flathunter | soup = self.get_page(search_url, self.driver, page_no)
flathunter | File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 145, in get_page
flathunter | afterlogin_string=self.afterlogin_string
flathunter | File "/usr/local/lib/python3.7/site-packages/backoff/_sync.py", line 105, in retry
flathunter | ret = target(*args, **kwargs)
flathunter | File "/usr/src/app/flathunter/abstract_crawler.py", line 76, in get_soup_from_url
flathunter | self.resolve_recaptcha(driver, checkbox, afterlogin_string)
flathunter | File "/usr/local/lib/python3.7/site-packages/backoff/_sync.py", line 105, in retry
flathunter | ret = target(*args, **kwargs)
flathunter | File "/usr/src/app/flathunter/abstract_crawler.py", line 190, in resolve_recaptcha
flathunter | iframe_present = self._wait_for_iframe(driver)
flathunter | File "/usr/src/app/flathunter/abstract_crawler.py", line 248, in _wait_for_iframe
flathunter | (By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
flathunter | File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 95, in until
flathunter | raise TimeoutException(message, screen, stacktrace)
flathunter | selenium.common.exceptions.TimeoutException: Message:
flathunter | Stacktrace:
flathunter | #0 0x55d9051522a3 <unknown>
flathunter | #1 0x55d904f10f77 <unknown>
flathunter | #2 0x55d904f4d80c <unknown>
flathunter | #3 0x55d904f4da71 <unknown>
flathunter | #4 0x55d904f87734 <unknown>
flathunter | #5 0x55d904f6db5d <unknown>
flathunter | #6 0x55d904f8547c <unknown>
flathunter | #7 0x55d904f6d903 <unknown>
flathunter | #8 0x55d904f40ece <unknown>
flathunter | #9 0x55d904f41fde <unknown>
flathunter | #10 0x55d9051a263e <unknown>
flathunter | #11 0x55d9051a5b79 <unknown>
flathunter | #12 0x55d90518889e <unknown>
flathunter | #13 0x55d9051a6a83 <unknown>
flathunter | #14 0x55d90517b505 <unknown>
flathunter | #15 0x55d9051c7ca8 <unknown>
flathunter | #16 0x55d9051c7e36 <unknown>
flathunter | #17 0x55d9051e3333 <unknown>
flathunter | #18 0x7fd708a11ea7 start_thread
flathunter |
@codders Already tried that, doesnt work..
Okay. I've made a PR #273 - you can try and see if that fixes your issue. Unfortunately it's not something I can reproduce locally, so it's a bit of guess work. Let me know!
@codders unfortunately didn't help. Immobilienscout crawling won't work even with increased timeout.
IS24
variable is not found with --headless
driver argument, while removing it solves the problem only for the first loop.
Doesn't work for me either.
Same error for me. Though it was running fine earlier today
I had a look at this again today. What I can see is that also if I run from the command line (without docker), I get the timeout / cannot find IS24 variable message. Debugging further, I can see that in these cases the bot detection has kicked in:
If I disable the '--headless' argument (or unset FLATHUNTER_HEADLESS_BROWSER), the immoscout crawl works as normal. Somehow, the version I have running in the cloud (which uses the headless argument and docker) is still succeeding.
The undetected_chromedriver package is supposed to make it impossible to detect the fact that we're driving the browser from a script, and that seemed to help us for a while, but I guess it's a cat and mouse game. If anyone has any hot tips on avoiding bot detection, those would be most welcome :)
I got this partially fixed: undetected_chromedriver provides a Docker image in which it is possible to run chromedriver without the --headless
flag. The image creates a virtual display on which the chrome window is rendered. I got it to work by basing the Dockerfile of this repository on the one of undetected-chromedriver
. But the browser crashes quite often, I'm still looking to fix that.
That's exciting news - thanks for taking a look! Often when I've seen crashes it's been about memory usage, but I guess you've already tried that. If you make a draft PR I can also have a go at running it here and see what happens.
Any updates on this?
@flyingdodo11 I haven't heard anything. I don't know if this helps you, but if you're just searching in Berlin and you're okay with a pretty default setup, you can also just use the hosted version: https://flathunter.codders.io . That's running okay right now (and crawling immoscout still).
I also ran into this issue. Any update or workaround would be great.
I tried this method, by basing my docker image from the undetected chromedriver like this:
FROM ultrafunk/undetected-chromedriver:latest
Also i set the flags "--no-sandbox" and "--disable-setuid-sandbox". I didn't set the "--headless" flag (That's the hole point) ..but it didn't work. I still couln't get past the bot detection. Then i though, that my ip address might be blacklisted and connected my container to a vpn (thanks to nordvpn-docker) ...but still no success
First i get this message for a period of time:
[2023/01/28 12:13:22|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
then in the end it shows a long error message and stops
@hruzgar Can you copy the long error message?
I've also tried running a job on Google Cloud Run based on the ultrafunk/undetected-chromedriver
image, however the container stops immediately after executing
running: /bin/sh -c python cloud_job.py
running keepUpScreen()
Container called exit(0)
What am I missing here?
this is the full lifecycle of the execution
haso:flathunter/ (main✗) $ sudo docker run --net=container:vpn --mount type=bind,source=/opt/flath
unter/config.yaml,target=/config.yaml flathunter
running: python flathunt.py -c /config.yaml
running keepUpScreen()
[2023/01/28 14:14:18|config.py |INFO ]: Using config path /config.yaml
[2023/01/28 14:14:18|chrome_wrapper.py |INFO ]: Initializing Chrome WebDriver for crawler
...
[2023/01/28 14:14:19|patcher.py |INFO ]: patching driver executable /root/.local/s
hare/undetected_chromedriver/753613c1953be3c0_chromedriver
[2023/01/28 14:14:32|abstract_crawler.py |INFO ]: Timeout waiting for iframe element - no c
aptcha verification necessary?
[2023/01/28 14:14:32|crawl_immobilienscout.py|WARNING ]: Unable to find IS24 variable in window
[2023/01/28 14:14:32|crawl_immobilienscout.py|ERROR ]: IS24 bot detection has identified our scr
ipt as a bot - we've been blocked
[2023/01/28 14:14:34|imagetyperz_solver.py |INFO ]: Trying to solve geetest.
[2023/01/28 14:14:35|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:41|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:46|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:51|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:14:56|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:02|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:07|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:12|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:17|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:23|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:28|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:33|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:38|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:44|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:49|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:54|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:15:59|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:05|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:10|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:15|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:20|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:26|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:31|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:36|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:41|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:47|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:53|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:16:58|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:04|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:09|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:14|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:19|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:25|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:30|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:35|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:40|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:46|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:51|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:17:56|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:01|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:07|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:12|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:17|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:22|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:28|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:33|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:38|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:43|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:18:49|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:00|_common.py |INFO ]: Backing off resolve_geetest(...) for 1.0s
(flathunter.captcha.captcha_solver.CaptchaUnsolvableError)
[2023/01/28 14:19:01|imagetyperz_solver.py |INFO ]: Trying to solve geetest.
[2023/01/28 14:19:01|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:06|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:12|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:17|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2023/01/28 14:19:22|imagetyperz_solver.py |INFO ]: Captcha is not ready yet, waiting...
Traceback (most recent call last):
File "/usr/src/app/flathunt.py", line 109, in <module>
main()
File "/usr/src/app/flathunt.py", line 105, in main
launch_flat_hunt(config, heartbeat)
File "/usr/src/app/flathunt.py", line 29, in launch_flat_hunt
hunter.hunt_flats()
File "/usr/src/app/flathunter/hunter.py", line 54, in hunt_flats
for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/app/flathunter/hunter.py", line 33, in crawl_for_exposes
return chain(*[try_crawl(searcher, url, max_pages)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/app/flathunter/hunter.py", line 33, in <listcomp>
return chain(*[try_crawl(searcher, url, max_pages)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/app/flathunter/hunter.py", line 25, in try_crawl
return searcher.crawl(url, max_pages)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/app/flathunter/abstract_crawler.py", line 142, in crawl
return self.get_results(url, max_pages)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 57, in get_results
soup = self.get_page(search_url, self.driver, page_no)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/app/flathunter/crawl_immobilienscout.py", line 145, in get_page
return self.get_soup_from_url(
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/backoff/_sync.py", line 105, in retry
ret = target(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/src/app/flathunter/abstract_crawler.py", line 77, in get_soup_from_url
return BeautifulSoup(driver.page_source, 'html.parser')
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/undetected_chromedriver/__init__.py", line 740, in
__getattribute__
return super().__getattribute__(item)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 541,
in page_source
return self.execute(Command.GET_PAGE_SOURCE)["value"]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 440,
in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 2
45, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of
page crash
from unknown error: cannot determine loading status
from tab crashed
(Session info: chrome=109.0.5414.119)
Stacktrace:
#0 0x5613f8e04303 <unknown>
#1 0x5613f8bd8bbd <unknown>
#2 0x5613f8bc3233 <unknown>
#3 0x5613f8bc1c77 <unknown>
#4 0x5613f8bc2408 <unknown>
#5 0x5613f8bcf67f <unknown>
#6 0x5613f8bd02d2 <unknown>
#7 0x5613f8be0fd0 <unknown>
#8 0x5613f8be534b <unknown>
#9 0x5613f8bc29c5 <unknown>
#10 0x5613f8be0bd2 <unknown>
#11 0x5613f8c4d7a0 <unknown>
#12 0x5613f8c35753 <unknown>
#13 0x5613f8c08a14 <unknown>
#14 0x5613f8c09b7e <unknown>
#15 0x5613f8e5332e <unknown>
#16 0x5613f8e56c0e <unknown>
#17 0x5613f8e39610 <unknown>
#18 0x5613f8e57c23 <unknown>
#19 0x5613f8e2b545 <unknown>
#20 0x5613f8e786a8 <unknown>
#21 0x5613f8e78836 <unknown>
#22 0x5613f8e93d13 <unknown>
#23 0x7fc0d591cea7 start_thread
[2023/01/28 14:19:29|__init__.py |INFO ]: ensuring close
@infctr you need to set "--no-sandbox" and "--disable-setuid-sandbox" flags in your config.yaml file. also don't set the "--headless" flag
@hruzgar Did imagetyperz
work for you before with IS24? I had a similar Captcha is not ready yet
error so I had to switch to 2captcha
yeah it was working (and is still working) on my main pc. But i want to run the bot on my server to not get a high energy bill (my pc is beefy). That's the reason i am trying to get it working inside docker without any gui.. I could still try if it'll work with 2captcha though. Worth a try fs
I've started the image with these driver flags but it didn't make a difference in the container unfortunately
"--no-sandbox",
"--disable-gpu",
"--disable-setuid-sandbox",
I just tried running the bot locally on my pc again. And the weird thing is that it works with the "--headless" argument for a certain amount of time, before it fails again but as soon as i comment the "--headless" flag and run the bot again, it fires up a chrome tab and it sais that i am a robot and thus not get access to the site.
@infctr The cloud_job script is expected to run once and then quit. It is designed to be installed as a cron job running on a timer. The flathunt script is configurable either to run in a loop, or as a one-time job.
@hruzgar CaptchaUnsolvableError sometimes comes up if it just can't solve the captcha, but it should retry and that shouldn't be fatal. Usually a message like 'session deleted because of page crash' comes after the container runs out of memory - are you running with a memory limit on your docker container?
Hi guys, I'm trying to setup the flathunter for ImmoScout24. Already tried it with ebay-kleinanzeigen und immowelt with success.
I already checked all other issues regarding this problem like Issue214, none of the solutions worked for me.
Also i tried it on MacOS and Ubuntu 20.04 with the normal version and the docker version. I always get the same errors.