Error/crash on json data with immoscout

step21 commented 2 years ago

After some time, it crashed with the following error. I will try to narrow it down further or maybe just restart via supervisor.

Traceback (most recent call last):
  File "flathunt.py", line 95, in <module>
    main()
  File "flathunt.py", line 92, in main
    launch_flat_hunt(config)
  File "flathunt.py", line 51, in launch_flat_hunt
    hunter.hunt_flats()
  File "/home/user/flathunter/flathunter/hunter.py", line 42, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
  File "/home/user/flathunter/flathunter/hunter.py", line 22, in crawl_for_exposes
    for searcher in self.config.searchers()
  File "/home/user/flathunter/flathunter/hunter.py", line 23, in <listcomp>
    for url in self.config.get('urls', list())])
  File "/home/user/flathunter/flathunter/abstract_crawler.py", line 142, in crawl
    return self.get_results(url, max_pages)
  File "/home/user/flathunter/flathunter/crawl_immobilienscout.py", line 60, in get_results
    soup = self.get_page(search_url, self.driver, page_no)
  File "/home/user/flathunter/flathunter/crawl_immobilienscout.py", line 120, in get_page
    return self.get_soup_from_url(search_url.format(page_no), driver=driver, captcha_api_key=self.captcha_api_key, checkbox=self.checkbox, afterlogin_string=self.afterlogin_string)
  File "/home/user/flathunter/flathunter/abstract_crawler.py", line 79, in get_soup_from_url
    self.resolvegeetest(driver, captcha_api_key)
  File "/home/user/flathunter/flathunter/abstract_crawler.py", line 186, in resolvegeetest
    recaptcha_answer = json.loads(recaptcha_answer)
  File "/home/user/miniforge3/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/home/user/miniforge3/lib/python3.7/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 2 (char 1)

flys1ck commented 2 years ago

I ran into the same issue. This happens, when 2captcha is under high load and results in

[2022/01/20 16:08:29|abstract_crawler.py|DEBUG      ]: Captcha promise: 503 Service Unavailable

Just wait until https://2captcha.com/public_statistics/service-load is back to normal load and restart the script again.

step21 commented 2 years ago

Cool, thanks for providing more input. I am running the script with supervisor anyway, so it restarts automatically. Ideally, this would be captured and a wait added so that 2captcha is not re-tried too often. (though I am not sure how often this is queried atm)

iwasherefirst2 commented 2 years ago

@step21 how did you setup supervisor with pip? I have installed this package on a VPS and I can execute this job with this script:

#!/bin/bash

pipenv shell
/home/adam/.local/share/virtualenvs/flathunter-SBM5Nxub/bin/python /home/adam/flathunter/flathunt.py 2>> /home/adam/flathunter/job.log

When I try to run this script using supervisor I get this issue:


Traceback (most recent call last):
  File "/home/adam/flathunter/flathunt.py", line 95, in <module>
    main()
  File "/home/adam/flathunter/flathunt.py", line 68, in main
    config = Config(config_handle.name)
  File "/home/adam/flathunter/flathunter/config.py", line 29, in __init__
    self.__searchers__ = [CrawlImmobilienscout(self),
  File "/home/adam/flathunter/flathunter/crawl_immobilienscout.py", line 43, in __init__
    self.driver = self.configure_driver(self.driver_executable_path, self.driver_arguments)
  File "/home/adam/flathunter/flathunter/abstract_crawler.py", line 53, in configure_driver
    driver = webdriver.Chrome(executable_path=driver_path, options=chrome_options)
  File "/home/adam/.local/share/virtualenvs/flathunter-SBM5Nxub/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 81, in __init__
    desired_capabilities=desired_capabilities)
  File "/home/adam/.local/share/virtualenvs/flathunter-SBM5Nxub/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 157, in __init__
    self.start_session(capabilities, browser_profile)
  File "/home/adam/.local/share/virtualenvs/flathunter-SBM5Nxub/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
  File "/home/adam/.local/share/virtualenvs/flathunter-SBM5Nxub/lib/python3.6/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/home/adam/.local/share/virtualenvs/flathunter-SBM5Nxub/lib/python3.6/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally.
  (unknown error: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

Any hints?

step21 commented 2 years ago

My supervisor file:

[program:flathunter]
command=/home/<user>/.local/bin/pipenv run python /home/<user>/flathunter/flathunt.py
directory=/home/<user>/flathunter/
user=<user>
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile=/home/<user>/flathunter/supervisor_flathunter.log

It uses pipenv instead of only pip, which does not replace pip but is the best venv tool I know. Though as you have a problem with google-chrome/chromedriver, I suspect the problem may be another one. Quick googling suggests that this f.e. happens when you runs as the wrong user or wrong permissions. If I had to guess, I would guess you didn't set a user, so supervisor is executing the process as its own user or at least not the one you want. It also helps to set a working directory for the process if you didn't do that yet.

iwasherefirst2 commented 2 years ago

@step21 okay thank you! What is the working directory? Is it where directory points to? And should this only include the .conf file? Thats all I found when searching for working directory: http://supervisord.org/running.html?highlight=working%20directory#running-supervisord

step21 commented 2 years ago

You have to differentiate between the main supervisor config file (which is what the link refers to) and the individual service config file. With any program you or supervisor run, it often has a "current working directory" which in a terminal is the directory you are currently in or also in python is available via os.getcwd(). In many cases this will not matter, but it will if f.e. you try to open a file without an absolute path, and also if you specify a log file without an absolute path, where it will be. In my config example above, I just set it to the flathunter directory.

iwasherefirst2 commented 2 years ago

@step21 sorry for asking once again. Tried to setup supervisor as you said, so I installed pipenv, now I get:


Traceback (most recent call last):
  File "/home/adam/flathunter/flathunt.py", line 15, in <module>
    from flathunter.hunter import Hunter
  File "/home/adam/flathunter/flathunter/hunter.py", line 5, in <module>
    from flathunter.config import Config
  File "/home/adam/flathunter/flathunter/config.py", line 4, in <module>
    import yaml
ModuleNotFoundError: No module named 'yaml'
`` `

codders commented 2 years ago

@iwasherefirst2 Are you running flathunter using the pipenv run command? If you run pipenv shell on the command line and then try typing import yaml into the python interpreter, does that work for you?

iwasherefirst2 commented 2 years ago

@codders yes, I have the command command=/usr/local/bin/pipenv run python /home/adam/flathunter/flathunt.py in supervisor. If I run this on the console it works on my VPN, but the supervisor fails. Also import yaml works in python interpreter when calling pipenv shell.

step21 commented 2 years ago

Sorry for only commenting now. This might seem/suggest that your shell environment is not the same for supervisor and when running via shell. Something similar f.e. happened to me for example when re-deploying for someone else, when the username was different, or when using su without switches to make it use a login shell. In my case, supervisor explicitly runs the flathunter as a normal user. If you run it as root, or in testing mixed it and sometimes run as root, sometimes as user, this might be the problem.

step21 commented 2 years ago

Anyway, closing this as the original problem is solved IIRC. If you still experience this issue @iwasherefirst2 please open a new one.

flathunters / flathunter

Error/crash on json data with immoscout #138