Closed mordax7 closed 3 years ago
You are right. I started chromium via chromium-browser --user-data-dir=test_dir --password-store=basic
to get a new profile in the test_dir
folder and to avoid the use of the gnome-keystore (because then the decryption of the cookies will fail).
Then I visited the desired url from immoscout24, chose to disable all cookies in the pop-up and closed chromium again.
Then I copied the test_dir
folder to the flathunter folder and added the following code
import browser_cookie3
...
class Crawler:
"""Defines the Crawler interface"""
...
cj = browser_cookie3.chrome(cookie_file='./test_dir/Default/Cookies', domain_name='.immobilienscout24.de')
def get_soup_from_url(self, url):
"""Creates a Soup object from the HTML at the provided URL"""
resp = requests.get(url, headers=self.HEADERS, cookies=self.cj)
# if "immobilienscout24" in url:
# print(url)
# print(resp.content)
if resp.status_code != 200:
self.__log__.error("Got response (%i): %s", resp.status_code, resp.content)
return BeautifulSoup(resp.content, 'html.parser')
Afterwards, I was able to crawl immoscout24 properly again. But I am not sure how long this will last until the bot protection kicks in again as this also happens from time to time if using Firefox and refresh the website too often.
You are right. I started chromium via
chromium-browser --user-data-dir=test_dir --password-store=basic
to get a new profile in thetest_dir
folder and to avoid the use of the gnome-keystore (because then the decryption of the cookies will fail).Then I visited the desired url from immoscout24, chose to disable all cookies in the pop-up and closed chromium again.
Then I copied the
test_dir
folder to the flathunter folder and added the following codeimport browser_cookie3 ... class Crawler: """Defines the Crawler interface""" ... cj = browser_cookie3.chrome(cookie_file='./test_dir/Default/Cookies', domain_name='.immobilienscout24.de') def get_soup_from_url(self, url): """Creates a Soup object from the HTML at the provided URL""" resp = requests.get(url, headers=self.HEADERS, cookies=self.cj) # if "immobilienscout24" in url: # print(url) # print(resp.content) if resp.status_code != 200: self.__log__.error("Got response (%i): %s", resp.status_code, resp.content) return BeautifulSoup(resp.content, 'html.parser')
Afterwards, I was able to crawl immoscout24 properly again. But I am not sure how long this will last until the bot protection kicks in again as this also happens from time to time if using Firefox and refresh the website too often.
Thanks for the find. here are the expire dates that I got from the cookies.
def get_soup_from_url(self, url):
"""Creates a Soup object from the HTML at the provided URL"""
if "immobilienscout24" in url:
resp = requests.get(url, headers=self.HEADERS, cookies=self.cj)
print(url)
#import pdb; pdb.set_trace()
if "Roboter" in resp.text:
print('Bot Protection kicked in again')
else:
resp = requests.get(url, headers=self.HEADERS)
if resp.status_code != 200:
self.__log__.error("Got response (%i): %s", resp.status_code, resp.content)
return BeautifulSoup(resp.content, 'html.parser')
Modified further to just apply the cookies when connecting to immoscout24.
But after a few minutes, the bot protection kicks in again... So this is also not a permanent solution.
@choeffer I'm quite new to python can you tell me how to setup your solution ?
@vitormalencar I switched back to the latest version from this repo and immoscout seems to work fine (at least one run). So I have dropped my changes.
hmmm strange I'm getting the no index error, not so sure what I should do about it =/
@vitormalencar Then you should try to append logs etc. there #53 as this seems to be the error/issue you are facing.
@vitormalencar Then you should try to append logs etc. there #53 as this seems to be the error/issue you are facing.
Hi, i am facing the same Problem: here is the log:
[2020/09/26 13:07:52|config.py |INFO ]: Using config /home/user/flathunter/config.yaml
[2020/09/26 13:07:52|flathunt.py |DEBUG ]: Settings from config: <flathunter.config.Config object at 0x7f1e778fc2b0>
[2020/09/26 13:07:52|crawl_immobilienscout.py|DEBUG ]: Got search URL https://www.immobilienscout24.de/Suche/shape/wohnung-mieten?shape=dX1sX0lxe2BwQXpqRWlmRGZmQG97QnBBfXFUX3BBb3dEZVl7S3loQHJjQXtkQmdxRXFyQT93ZEZ6ZU1xcEBiZ0hsZUFocUVyQXBuQnZoQHpLdGpBbnlD&numberofrooms=3.0-&price=-1800.0&livingspace=80.0-&enteredFrom=result_list#/&pagenumber={0}
[2020/09/26 13:07:52|crawl_immobilienscout.py|DEBUG ]: Index Error occurred
[2020/09/26 13:07:52|crawl_immobilienscout.py|DEBUG ]: []
[2020/09/26 13:07:52|crawl_immobilienscout.py|DEBUG ]: extracted: 0
It totally works on other sites like ebay-kleinanzeigen.de.
The strange thing is, that it worked one time (for this test search at least) and posted all the results. then i killed the hunt (ctrl + c). after that i could not get any further results...
Any idea? Are there some cookies laying around or so?
Thanks a lot for help!!
Problem will get solved once https://github.com/flathunters/flathunter/pull/61 gets merged. But will require https://2captcha.com/.
https://github.com/flathunters/flathunter/pull/61 is merged, closing the ticket.
immobilienscout24 added cookies to their headers. Have to get them before proceeding with the crawling.