flathunters / flathunter

A bot to help people with their rental real-estate search. 🏠🤖
GNU Affero General Public License v3.0
831 stars 179 forks source link

www.immobilienscout24.de crawl_immobilienscout.py|ERROR ]: Index error occurred #214

Closed heapxor closed 1 year ago

heapxor commented 1 year ago

hello, using following url in config> urls:

after execution i am getting following error .... is that because of 2captcha is missing in config file?

flathunter@docker-base:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py /usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release warnings.warn( /usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( [2022/09/05 15:41:58|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml [2022/09/05 15:41:58|crawl_immobilienscout.py|ERROR ]: Index error occurred

^CTraceback (most recent call last): File "/home/flathunter/flathunter/flathunt.py", line 110, in main() File "/home/flathunter/flathunter/flathunt.py", line 106, in main launch_flat_hunt(config, heartbeat) File "/home/flathunter/flathunter/flathunt.py", line 36, in launch_flat_hunt time.sleep(config.loop_period_seconds()) KeyboardInterrupt

thanks!

codders commented 1 year ago

Yes - that's very likely. Crawling immoscout without 2captcha / imagetyperz support is expected to fail. Does it work if you configure the captcha solving?

heapxor commented 1 year ago

@codders is there any diff between 2caotcha / imagetyperz? thaz

codders commented 1 year ago

There isn't much difference. For a while we had problems with 2captcha, so we integrated Imagetyprz as a backup, but 2captcha is working fine again since a while.

I would suggest 2captcha for now - that's what I'm using, so at least if it breaks there is someone trying to fix it for you :)

heapxor commented 1 year ago

@codders sure thanks.

still getting weird error :(

flathunter@docker-base:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release warnings.warn( /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( [2022/09/09 13:05:04|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml [2022/09/09 13:05:04|abstract_crawler.py |INFO ]: Initializing Chrome WebDriver for crawler "CrawlImmobilienscout"... [2022/09/09 13:05:05|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/d9b6d2334d8b50fd_chromedriver Process Process-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/dprocess.py", line 59, in _start_detached p = Popen([executable, *args], stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs) File "/usr/lib/python3.10/subprocess.py", line 966, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.10/subprocess.py", line 1717, in _execute_child and os.path.dirname(executable) File "/usr/lib/python3.10/posixpath.py", line 152, in dirname p = os.fspath(p) TypeError: expected str, bytes or os.PathLike object, not NoneType

codders commented 1 year ago

Can you set verbose_logging to true in your config and try again? Also might be good to clear out the webdriver-manager cache (/home/flathunter/.wdm).

heapxor commented 1 year ago

@codders i turned verbose logging on; but i cant see that folder connected to the wedriver-manager cache; do i have to install webdriver-manager cache?

flathunter@docker-base:~/flathunter$ ls /home/flathunter/.wdm ls: cannot access '/home/flathunter/.wdm': No such file or directory

the error is here> flathunter@docker-base:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release warnings.warn( /usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( [2022/09/09 16:16:07|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml [2022/09/09 16:16:07|logging.py |DEBUG ]: Settings from config: <flathunter.config.Config object at 0x7f820dc66a70> [2022/09/09 16:16:07|abstract_crawler.py |INFO ]: Initializing Chrome WebDriver for crawler "CrawlImmobilienscout"... [2022/09/09 16:16:08|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/c1484b1e513af397_chromedriver Process Process-1: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/dprocess.py", line 59, in _start_detached p = Popen([executable, *args], stdin=PIPE, stdout=PIPE, stderr=PIPE, **kwargs) File "/usr/lib/python3.10/subprocess.py", line 966, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/lib/python3.10/subprocess.py", line 1717, in _execute_child and os.path.dirname(executable) File "/usr/lib/python3.10/posixpath.py", line 152, in dirname p = os.fspath(p) TypeError: expected str, bytes or os.PathLike object, not NoneType

codders commented 1 year ago

Seems like you're not the only person with this issue: https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/285 https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/787 https://github.com/ultrafunkamsterdam/undetected-chromedriver/issues/193

What can you tell me about your execution environment? Is Google Chrome / Chromium definitely installed? Are you running inside any kind of container or virtualisation?

heapxor commented 1 year ago

@codders running it as linux user on ubuntu sever; is google chrome/chromium package requ? i cant see it in prerequisite.

thanks

heapxor commented 1 year ago

@codders okay i installed chrome driver and chromium browser, executed code as follows

flathunter@heap-virtual-machine:~/flathunter$ /home/flathunter/.local/bin/pipenv run python flathunt.py /usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release warnings.warn( /usr/lib/python3/dist-packages/pkg_resources/init.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( [2022/09/10 00:22:41|config.py |INFO ]: Using config path /home/flathunter/flathunter/config.yaml [2022/09/10 00:22:41|logging.py |DEBUG ]: Settings from config: <flathunter.config.Config object at 0x7f2d146ef310> [2022/09/10 00:22:41|abstract_crawler.py |INFO ]: Initializing Chrome WebDriver for crawler "CrawlImmobilienscout"... [2022/09/10 00:22:41|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/8f45eca73f1bdb25_chromedriver

end thats the error:

`Traceback (most recent call last): File "/home/flathunter/flathunter/flathunt.py", line 110, in main() File "/home/flathunter/flathunter/flathunt.py", line 76, in main config.init_searchers() File "/home/flathunter/flathunter/flathunter/config.py", line 96, in init_searchers CrawlImmobilienscout(self), File "/home/flathunter/flathunter/flathunter/crawl_immobilienscout.py", line 39, in init self.driver = self.configure_driver(driver_arguments) File "/home/flathunter/flathunter/flathunter/abstract_crawler.py", line 59, in configure_driver driver = uc.Chrome(options=chrome_options) File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/init.py", line 401, in init super(Chrome, self).init( File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in init super().init(DesiredCapabilities.CHROME['browserName'], "goog", File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/chromium/webdriver.py", line 92, in init super().init( File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 270, in init self.start_session(capabilities, browser_profile) File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/init.py", line 589, in start_session super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session( File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 363, in start_session response = self.execute(Command.NEW_SESSION, parameters) File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 428, in execute self.error_handler.check_response(response) File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:44791 from chrome not reachable Stacktrace:

0 0x556e5caed693

1 0x556e5c8e69db

2 0x556e5c8d681e

3 0x556e5c90f677

4 0x556e5c906e9f

5 0x556e5c942953

6 0x556e5c93c743

7 0x556e5c912533

8 0x556e5c913715

9 0x556e5cb3d7bd

10 0x556e5cb40bf9

11 0x556e5cb22f2e

12 0x556e5cb419b3

13 0x556e5cb16e4f

14 0x556e5cb60ea8

15 0x556e5cb61052

16 0x556e5cb7b71f

17 0x7fe397279b43

`

heapxor commented 1 year ago

maybe the issue is that ubutu doesnt have chrome but chromium only?

flathunter@heap-virtual-machine:~/flathunter$ chrome --version Command 'chrome' not found, did you mean: command 'chroma' from deb chroma (1.19-1ubuntu1) command 'chroma' from deb golang-chroma (0.9.4-1) Try: apt install <deb name> flathunter@heap-virtual-machine:~/flathunter$ chromium --version Chromium 105.0.5195.52 snap flathunter@heap-virtual-machine:~/flathunter$

edit2 i tried to install chrome via howto posted here > https://linuxize.com/post/how-to-install-google-chrome-web-browser-on-ubuntu-20-04/

edit3 still crashing same error

similar issue? https://stackoverflow.com/questions/73115181/message-unknown-error-cannot-connect-to-chrome-at-127-0-0-150276 selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:44729 from chrome not reachable

heapxor commented 1 year ago

i run Gnome; executed it via gnome ... browser got opened

` 2022/09/10 00:42:04|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/8c382c85cf8c97b3_chromedriver [2022/09/10 00:42:25|_common.py |INFO ]: Backing off get_soup_from_url(...) for 0.8s (selenium.common.exceptions.TimeoutException: Message: Stacktrace:

0 0x564be599f693

1 0x564be5798b0a

2 0x564be57d15f7

3 0x564be57d17c1

4 0x564be5804804

5 0x564be57ee94d

6 0x564be58024b0

7 0x564be57ee743

8 0x564be57c4533

9 0x564be57c5715

10 0x564be59ef7bd

11 0x564be59f2bf9

12 0x564be59d4f2e

13 0x564be59f39b3

14 0x564be59c8e4f

15 0x564be5a12ea8

16 0x564be5a13052

17 0x564be5a2d71f

18 0x7f1d4a8a7b43 )

`

the configuration regardin the telegram might be confusing? the receiver_ids is negative number... so in configuration i assume it should be set as following:

receiver_ids:

correct?

Also in case i run the script now ... i am getting logs as below, is that correct behavior?

[2022/09/10 00:57:18|_common.py |INFO ]: Backing off get_soup_from_url(...) for 0.4s (selenium.common.exceptions.TimeoutException: Message: Stacktrace:

0 0x5571f5b8b693

1 0x5571f5984b0a

2 0x5571f59bd5f7

3 0x5571f59bd7c1

4 0x5571f59f0804

5 0x5571f59da94d

6 0x5571f59ee4b0

7 0x5571f59da743

8 0x5571f59b0533

9 0x5571f59b1715

10 0x5571f5bdb7bd

11 0x5571f5bdebf9

12 0x5571f5bc0f2e

13 0x5571f5bdf9b3

14 0x5571f5bb4e4f

15 0x5571f5bfeea8

16 0x5571f5bff052

17 0x5571f5c1971f

18 0x7f5fb82b8b43 )

codders commented 1 year ago

Hey @heapxor ,

Sorry you're having some troubles here. It would certainly make sense for us to update the documentation based on the spots that didn't make sense for you.

Incidentally, if you are looking for a flat in Berlin, you might also have success just using the hosted version at https://flathunter.codders.io - you can just log in there with Telegram and set a (basic) filter.

But otherwise, I hope to have some time in the next days to look at your issues, or else maybe someone else can support you.

codders commented 1 year ago

@heapxor ,

I don't think your Telegram ID should be a negative number. How did you get that?

To make chrome work, maybe try these driver-arguments:

                - --no-sandbox
                - --headless
                - --disable-gpu
                - --remote-debugging-port=9222
                - --disable-dev-shm-usage
                - window-size=1024,768
heapxor commented 1 year ago

@codders, why do u think it shouldnt be a negative number? when i call this > my bot receives messge curl "https://api.telegram.org/botTOKEN/sendMEssage?chat_id=-628934068&text=Hello+World" and u can see id is negative number.

where i can put these driver arguments? thanks!

heapxor commented 1 year ago

also wondering is that a proper behavior?

image

and here i can see the crash after 7h of running code

image + image

codders commented 1 year ago

Ah. okay. then that must be your ID :)

Arguments go in the config file like this: 2022-09-10-120234_556x303_scrot

If you are seeing crashes after some hours, try the arguments here. If the problem persists, maybe check that you have enough memory free (around 1GB for the browser and python etc.). But happy that you are receiving messages now :)

heapxor commented 1 year ago

@codders cool will try the arguments! thanks

just wondering ... is that something that has to be analyzed further or thats okay?

image

yes will try to add more ram to that machine

codders commented 1 year ago

The CAPCHA_NOT_READY message is very normal. That happens every time a captcha is solved.

CaptchaUnsolvableError also happens from time to time. Sometimes, 2captcha just can't solve the captcha. The IndexError: list index out of range happens with ImmoScout when the captcha solving fails.

When you get these errors, best is just to restart. I want to change the code soon so that it retries if it gets a CaptchaUnsolvableError. But if you see that message every time, it is probably a problem with 2captcha (or something with the ImmoScout website has changed).

For now, you can either run the code as a cron job - set it to run every 10 minutes then quit (by disabling the 'loop' option), or you can run it as a systemd service (there is some documentation around that). Systemd will restart it when it exits.

TimeoutException is also possible. It's not a bad thing if that happens - the system will retry.

heapxor commented 1 year ago

@codders, sounds cool. i was thinking to run it every 4minutes, is that also okay?

okay disabling loop and execute via cron makes sense; in that case i can prevent the issue with the captcha and it should be Safe.

where do u set that timeoutexception ? or its in plan to be developed? thanks!

edit2 @codders also assuming there is no functionality to automate scenario as: send message to the new add

codders commented 1 year ago

Running more quickly is also okay for Immoscout. With ebay Kleinanzeigen you can get an IP block if you crawl too quickly. Just be aware there is no locking / concurrency control, so if the previous run didn't finish after 4 mins, you will have two flathunters at once, which will have weird effects.

For the timeouts and other errors, there is no plan right now. People who want it to be different make pull requests :)

heapxor commented 1 year ago

@codders, it stopped to work?:(

[2022/09/11 21:39:20|patcher.py              |INFO    ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/b023f19cc09c2dbf_chromedriver
Traceback (most recent call last):
  File "/home/flathunter/flathunter/flathunt.py", line 110, in <module>
    main()
  File "/home/flathunter/flathunter/flathunt.py", line 76, in main
    config.init_searchers()
  File "/home/flathunter/flathunter/flathunter/config.py", line 96, in init_searchers
    CrawlImmobilienscout(self),
  File "/home/flathunter/flathunter/flathunter/crawl_immobilienscout.py", line 39, in __init__
    self.driver = self.configure_driver(driver_arguments)
  File "/home/flathunter/flathunter/flathunter/abstract_crawler.py", line 59, in configure_driver
    driver = uc.Chrome(options=chrome_options)
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/__init__.py", line 401, in __init__
    super(Chrome, self).__init__(
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in __init__
    super().__init__(DesiredCapabilities.CHROME['browserName'], "goog",
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/chromium/webdriver.py", line 92, in __init__
    super().__init__(
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 270, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/undetected_chromedriver/__init__.py", line 589, in start_session
    super(selenium.webdriver.chrome.webdriver.WebDriver, self).start_session(
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 363, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/webdriver.py", line 428, in execute
    self.error_handler.check_response(response)
  File "/home/flathunter/.local/share/virtualenvs/flathunter-2Wmg4yEE/lib/python3.10/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:49795
from chrome not reachable
Stacktrace:

any idea why? selenium.common.exceptions.WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:49795 from chrome not reachable Stacktrace:

heapxor commented 1 year ago

hm so i commented out

- "--remte-debugging-port=9222"

       #         - "--disable-dev-shm-usage"

and it works

codders commented 1 year ago

@heapxor Can we mark this issue as closed? Do you want to add some information to the README about your tips for making it work successfully?

mourraille commented 1 year ago

These driver arguments did the trick for me. Currently running on Ubuntu. I also had to install chrome as suggested by @heapxor

codders commented 1 year ago

okay. I'll mark this as closed. If you want to make a PR to update the documentation about the chrome requirement for 2captcha support, that would be very welcome :)