immobilienscout24: add missing cookies to the header

mordax7 commented 4 years ago

immobilienscout24 added cookies to their headers. Have to get them before proceeding with the crawling.

choeffer commented 4 years ago

You are right. I started chromium via chromium-browser --user-data-dir=test_dir --password-store=basic to get a new profile in the test_dir folder and to avoid the use of the gnome-keystore (because then the decryption of the cookies will fail).

Then I visited the desired url from immoscout24, chose to disable all cookies in the pop-up and closed chromium again.

Then I copied the test_dir folder to the flathunter folder and added the following code

import browser_cookie3

...

class Crawler:
    """Defines the Crawler interface"""

...

    cj = browser_cookie3.chrome(cookie_file='./test_dir/Default/Cookies', domain_name='.immobilienscout24.de')

    def get_soup_from_url(self, url):
        """Creates a Soup object from the HTML at the provided URL"""
        resp = requests.get(url, headers=self.HEADERS, cookies=self.cj)
        # if "immobilienscout24" in url:
        #     print(url)
        #     print(resp.content)
        if resp.status_code != 200:
            self.__log__.error("Got response (%i): %s", resp.status_code, resp.content)
        return BeautifulSoup(resp.content, 'html.parser')

Afterwards, I was able to crawl immoscout24 properly again. But I am not sure how long this will last until the bot protection kicks in again as this also happens from time to time if using Firefox and refresh the website too often.

mordax7 commented 4 years ago

You are right. I started chromium via chromium-browser --user-data-dir=test_dir --password-store=basic to get a new profile in the test_dir folder and to avoid the use of the gnome-keystore (because then the decryption of the cookies will fail).

Then I visited the desired url from immoscout24, chose to disable all cookies in the pop-up and closed chromium again.

Then I copied the test_dir folder to the flathunter folder and added the following code
import browser_cookie3

...

class Crawler:
    """Defines the Crawler interface"""

...

    cj = browser_cookie3.chrome(cookie_file='./test_dir/Default/Cookies', domain_name='.immobilienscout24.de')

    def get_soup_from_url(self, url):
        """Creates a Soup object from the HTML at the provided URL"""
        resp = requests.get(url, headers=self.HEADERS, cookies=self.cj)
        # if "immobilienscout24" in url:
        #     print(url)
        #     print(resp.content)
        if resp.status_code != 200:
            self.__log__.error("Got response (%i): %s", resp.status_code, resp.content)
        return BeautifulSoup(resp.content, 'html.parser')
Afterwards, I was able to crawl immoscout24 properly again. But I am not sure how long this will last until the bot protection kicks in again as this also happens from time to time if using Firefox and refresh the website too often.

Thanks for the find. here are the expire dates that I got from the cookies.

choeffer commented 4 years ago

    def get_soup_from_url(self, url):
        """Creates a Soup object from the HTML at the provided URL"""
        if "immobilienscout24" in url:
            resp = requests.get(url, headers=self.HEADERS, cookies=self.cj)
            print(url)
            #import pdb; pdb.set_trace()
            if "Roboter" in resp.text:
                print('Bot Protection kicked in again')
        else:
            resp = requests.get(url, headers=self.HEADERS)
        if resp.status_code != 200:
            self.__log__.error("Got response (%i): %s", resp.status_code, resp.content)
        return BeautifulSoup(resp.content, 'html.parser')

Modified further to just apply the cookies when connecting to immoscout24.

choeffer commented 4 years ago

But after a few minutes, the bot protection kicks in again... So this is also not a permanent solution.

vitormalencar commented 3 years ago

@choeffer I'm quite new to python can you tell me how to setup your solution ?

choeffer commented 3 years ago

@vitormalencar I switched back to the latest version from this repo and immoscout seems to work fine (at least one run). So I have dropped my changes.

vitormalencar commented 3 years ago

hmmm strange I'm getting the no index error, not so sure what I should do about it =/

choeffer commented 3 years ago

@vitormalencar Then you should try to append logs etc. there #53 as this seems to be the error/issue you are facing.

pcace commented 3 years ago

@vitormalencar Then you should try to append logs etc. there #53 as this seems to be the error/issue you are facing.

Hi, i am facing the same Problem: here is the log:

[2020/09/26 13:07:52|config.py         |INFO    ]: Using config /home/user/flathunter/config.yaml
[2020/09/26 13:07:52|flathunt.py       |DEBUG   ]: Settings from config: <flathunter.config.Config object at 0x7f1e778fc2b0>
[2020/09/26 13:07:52|crawl_immobilienscout.py|DEBUG   ]: Got search URL https://www.immobilienscout24.de/Suche/shape/wohnung-mieten?shape=dX1sX0lxe2BwQXpqRWlmRGZmQG97QnBBfXFUX3BBb3dEZVl7S3loQHJjQXtkQmdxRXFyQT93ZEZ6ZU1xcEBiZ0hsZUFocUVyQXBuQnZoQHpLdGpBbnlD&numberofrooms=3.0-&price=-1800.0&livingspace=80.0-&enteredFrom=result_list#/&pagenumber={0}
[2020/09/26 13:07:52|crawl_immobilienscout.py|DEBUG   ]: Index Error occurred
[2020/09/26 13:07:52|crawl_immobilienscout.py|DEBUG   ]: []
[2020/09/26 13:07:52|crawl_immobilienscout.py|DEBUG   ]: extracted: 0

It totally works on other sites like ebay-kleinanzeigen.de.

The strange thing is, that it worked one time (for this test search at least) and posted all the results. then i killed the hunt (ctrl + c). after that i could not get any further results...

Any idea? Are there some cookies laying around or so?

Thanks a lot for help!!

mordax7 commented 3 years ago

Problem will get solved once https://github.com/flathunters/flathunter/pull/61 gets merged. But will require https://2captcha.com/.

mordax7 commented 3 years ago

https://github.com/flathunters/flathunter/pull/61 is merged, closing the ticket.

flathunters / flathunter

immobilienscout24: add missing cookies to the header #51