flathunters / flathunter

A bot to help people with their rental real-estate search. 🏠🤖
GNU Affero General Public License v3.0
830 stars 179 forks source link

Issue with crawling ImmoScout24: window.IS24 property resultList is missing #458

Open acidassassin opened 1 year ago

acidassassin commented 1 year ago

Hey there,

i debugged the code for the immoscout24 crawler and it seems like the follwing script returns "null" result_json = self.get_driver_force().execute_script('return window.IS24.resultList;')

I've checked in the browser and it looks like the window.IS24.resultList is not there anymore.

Anyone has a working solution? Thanks

acidassassin commented 1 year ago

Okay it seems like for "gewerbe-flaechen" there is no resultList...

codders commented 1 year ago

Yeah - there's no resultList, but there is a window.__INITIAL_STATE__ containing all the data you need. You should be able to parse is similar to the way this StackOverflow answer handles it:

https://stackoverflow.com/questions/67203717/beautifulsoup-how-to-get-data-from-window-initial-state

It would probably be possible to extend the Immoscout crawler to check if __INITIAL_STATE__ is present in the result fetched from the server.

Are you a python developer? You want to give that a go?

acidassassin commented 1 year ago

Hey @codders

thank you for your reply! I would call myself more a scripting language developer, but Python is fun. :-)

I‘ve managed to get the __INITIAL_STATE__ as a String, but somehow i‘m Not able to convert to a functional dict/json. Do you have any advice?

codders commented 1 year ago

What kind of error do you get? How are you parsing it?

adobryn commented 10 months ago

I've tried that:

logger.info("Trying to get __INITIAL_STATE__")
data = re.search(r"window\.__INITIAL_STATE__=(.*?);", search_url)
if data is not None:
    data = data.group(1)
    data = json.loads(data)
    print(json.dumps(data, indent=4))

but I'm still dealing with "IS24 bot detection has identified our script as a bot - we've been blocked" so I can not check if it's really working :D