Closed heapxor closed 2 years ago
Yes - that's very likely. Crawling immoscout without 2captcha / imagetyperz support is expected to fail. Does it work if you configure the captcha solving?
@codders is there any diff between 2caotcha / imagetyperz? thaz
There isn't much difference. For a while we had problems with 2captcha, so we integrated Imagetyprz as a backup, but 2captcha is working fine again since a while.
I would suggest 2captcha for now - that's what I'm using, so at least if it breaks there is someone trying to fix it for you :)
@codders sure thanks.
still getting weird error :(
flathunter@docker-base:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release warnings.warn( /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( [2022/09/09 13:05:04|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml [2022/09/09 13:05:04|abstract_crawler.py |INFO ]: Initializing Chrome WebDriver for crawler "CrawlImmobilienscout"... [2022/09/09 13:05:05|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/d9b6d2334d8b50fd_chromedriver Process Process-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/dprocess.py", line 59, in _start_detached p = Popen([executable, *args], stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs) File "/usr/lib/python3.10/subprocess.py", line 966, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.10/subprocess.py", line 1717, in _execute_child and os.path.dirname(executable) File "/usr/lib/python3.10/posixpath.py", line 152, in dirname p = os.fspath(p) TypeError: expected str, bytes or os.PathLike object, not NoneType
Can you set verbose_logging to true in your config and try again? Also might be good to clear out the webdriver-manager cache (/home/flathunter/.wdm).
@codders i turned verbose logging on; but i cant see that folder connected to the wedriver-manager cache; do i have to install webdriver-manager cache?
flathunter@docker-base:~/flathunter$ ls /home/flathunter/.wdm ls: cannot access '/home/flathunter/.wdm': No such file or directory
the error is here>
flathunter@docker-base:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release warnings.warn( /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( [2022/09/09 16:16:07|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml [2022/09/09 16:16:07|logging.py |DEBUG ]: Settings from config: <flathunter.config.Config object at 0x7f820dc66a70> [2022/09/09 16:16:07|abstract_crawler.py |INFO ]: Initializing Chrome WebDriver for crawler "CrawlImmobilienscout"... [2022/09/09 16:16:08|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/c1484b1e513af397_chromedriver Process Process-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/dprocess.py", line 59, in _start_detached p = Popen([executable, *args], stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs) File "/usr/lib/python3.10/subprocess.py", line 966, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.10/subprocess.py", line 1717, in _execute_child and os.path.dirname(executable) File "/usr/lib/python3.10/posixpath.py", line 152, in dirname p = os.fspath(p) TypeError: expected str, bytes or os.PathLike object, not NoneType
Seems like you're not the only person with this issue: https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/285 https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/787 https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/193
What can you tell me about your execution environment? Is Google Chrome / Chromium definitely installed? Are you running inside any kind of container or virtualisation?
@codders running it as linux user on ubuntu sever; is google chrome/chromium package requ? i cant see it in prerequisite.
thanks
@codders okay i installed chrome driver and chromium browser, executed code as follows
flathunter@heap-virtual-machine:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py /usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release warnings.warn( /usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( [2022/09/10 00:22:41|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml [2022/09/10 00:22:41|logging.py |DEBUG ]: Settings from config: <flathunter.config.Config object at 0x7f2d146ef310> [2022/09/10 00:22:41|abstract_crawler.py |INFO ]: Initializing Chrome WebDriver for crawler "CrawlImmobilienscout"... [2022/09/10 00:22:41|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/8f45eca73f1bdb25_chromedriver
end thats the error:
`Traceback (most recent call last):
File "/home/flathunter/flathunter/flathunt.py", line 110, in
`
maybe the issue is that ubutu doesnt have chrome but chromium only?
flathunter@heap-virtual-machine:~/flathunter$ chrome --version Command 'chrome' not found, did you mean: command 'chroma' from deb chroma (1.19-1ubuntu1) command 'chroma' from deb golang-chroma (0.9.4-1) Try: apt install <deb name> flathunter@heap-virtual-machine:~/flathunter$ chromium --version Chromium 105.0.5195.52 snap flathunter@heap-virtual-machine:~/flathunter$
edit2 i tried to install chrome via howto posted here > https://linuxize.com/post/how-to-install-google-chrome-web-browser-on-ubuntu-20-04/
edit3 still crashing same error
similar issue? https://stackoverflow.com/questions/73115181/message-unknown-error-cannot-connect-to-chrome-at-127-0-0-150276 selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:44729 from chrome not reachable
i run Gnome; executed it via gnome ... browser got opened
` 2022/09/10 00:42:04|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/8c382c85cf8c97b3_chromedriver [2022/09/10 00:42:25|_common.py |INFO ]: Backing off get_soup_from_url(...) for 0.8s (selenium.common.exceptions.TimeoutException: Message: Stacktrace:
`
the configuration regardin the telegram might be confusing? the receiver_ids is negative number... so in configuration i assume it should be set as following:
receiver_ids:
correct?
Also in case i run the script now ... i am getting logs as below, is that correct behavior?
[2022/09/10 00:57:18|_common.py |INFO ]: Backing off get_soup_from_url(...) for 0.4s (selenium.common.exceptions.TimeoutException: Message: Stacktrace:
Hey @heapxor ,
Sorry you're having some troubles here. It would certainly make sense for us to update the documentation based on the spots that didn't make sense for you.
Incidentally, if you are looking for a flat in Berlin, you might also have success just using the hosted version at https://flathunter.codders.io - you can just log in there with Telegram and set a (basic) filter.
But otherwise, I hope to have some time in the next days to look at your issues, or else maybe someone else can support you.
@heapxor ,
I don't think your Telegram ID should be a negative number. How did you get that?
To make chrome work, maybe try these driver-arguments:
- --no-sandbox
- --headless
- --disable-gpu
- --remote-debugging-port=9222
- --disable-dev-shm-usage
- window-size=1024,768
@codders, why do u think it shouldnt be a negative number? when i call this > my bot receives messge curl "https://api.telegram.org/botTOKEN/sendMEssage?chat_id=-628934068&text=Hello+World" and u can see id is negative number.
where i can put these driver arguments? thanks!
also wondering is that a proper behavior?
and here i can see the crash after 7h of running code
+
Ah. okay. then that must be your ID :)
Arguments go in the config file like this:
If you are seeing crashes after some hours, try the arguments here. If the problem persists, maybe check that you have enough memory free (around 1GB for the browser and python etc.). But happy that you are receiving messages now :)
@codders cool will try the arguments! thanks
just wondering ... is that something that has to be analyzed further or thats okay?
yes will try to add more ram to that machine
The CAPCHA_NOT_READY message is very normal. That happens every time a captcha is solved.
CaptchaUnsolvableError also happens from time to time. Sometimes, 2captcha just can't solve the captcha. The IndexError: list index out of range
happens with ImmoScout when the captcha solving fails.
When you get these errors, best is just to restart. I want to change the code soon so that it retries if it gets a CaptchaUnsolvableError. But if you see that message every time, it is probably a problem with 2captcha (or something with the ImmoScout website has changed).
For now, you can either run the code as a cron job - set it to run every 10 minutes then quit (by disabling the 'loop' option), or you can run it as a systemd service (there is some documentation around that). Systemd will restart it when it exits.
TimeoutException is also possible. It's not a bad thing if that happens - the system will retry.
@codders, sounds cool. i was thinking to run it every 4minutes, is that also okay?
okay disabling loop and execute via cron makes sense; in that case i can prevent the issue with the captcha and it should be Safe.
where do u set that timeoutexception ? or its in plan to be developed? thanks!
edit2 @codders also assuming there is no functionality to automate scenario as: send message to the new add
Running more quickly is also okay for Immoscout. With ebay Kleinanzeigen you can get an IP block if you crawl too quickly. Just be aware there is no locking / concurrency control, so if the previous run didn't finish after 4 mins, you will have two flathunters at once, which will have weird effects.
For the timeouts and other errors, there is no plan right now. People who want it to be different make pull requests :)
@codders, it stopped to work?:(
[2022/09/11 21:39:20|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/b023f19cc09c2dbf_chromedriver
Traceback (most recent call last):
File "/home/flathunter/flathunter/flathunt.py", line 110, in <module>
main()
File "/home/flathunter/flathunter/flathunt.py", line 76, in main
config.init_searchers()
File "/home/flathunter/flathunter/flathunter/config.py", line 96, in init_searchers
CrawlImmobilienscout(self),
File "/home/flathunter/flathunter/flathunter/crawl_immobilienscout.py", line 39, in __init__
self.driver = self.configure_driver(driver_arguments)
File "/home/flathunter/flathunter/flathunter/abstract_crawler.py", line 59, in configure_driver
driver = uc.Chrome(options=chrome_options)
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/__init__.py", line 401, in __init__
super(Chrome, self).__init__(
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in __init__
super().__init__(DesiredCapabilities.CHROME['browserName'], "goog",
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/chromium/webdriver.py", line 92, in __init__
super().__init__(
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 270, in __init__
self.start_session(capabilities, browser_profile)
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/__init__.py", line 589, in start_session
super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 363, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 428, in execute
self.error_handler.check_response(response)
File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:49795
from chrome not reachable
Stacktrace:
any idea why? selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:49795 from chrome not reachable Stacktrace:
hm so i commented out
# - "--disable-dev-shm-usage"
and it works
@heapxor Can we mark this issue as closed? Do you want to add some information to the README about your tips for making it work successfully?
These driver arguments did the trick for me. Currently running on Ubuntu. I also had to install chrome as suggested by @heapxor
okay. I'll mark this as closed. If you want to make a PR to update the documentation about the chrome requirement for 2captcha support, that would be very welcome :)
hello, using following url in config> urls:
after execution i am getting following error .... is that because of 2captcha is missing in config file?
flathunter@docker-base:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py /usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release warnings.warn( /usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( [2022/09/05 15:41:58|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml [2022/09/05 15:41:58|crawl_immobilienscout.py|ERROR ]: Index error occurred
^CTraceback (most recent call last): File "/home/flathunter/flathunter/flathunt.py", line 110, in
main()
File "/home/flathunter/flathunter/flathunt.py", line 106, in main
launch_flat_hunt(config, heartbeat)
File "/home/flathunter/flathunter/flathunt.py", line 36, in launch_flat_hunt
time.sleep(config.loop_period_seconds())
KeyboardInterrupt
thanks!