asmaier / ImmoSpider

Immospider is a crawler for the Immoscout24 website.
187 stars 49 forks source link

scrapy 405 method can not handle #9

Open Gettlar opened 4 years ago

Gettlar commented 4 years ago

Hi everyone,

until two days ago I used almost every the immoscout spider with docker. Now I get a 405 method error. scrapy can scrapy the landing page with a 200 response but not a search url as described in the tutorial. Immobilienscout detects scrapy as a bot and directs to a recaptcha site...

the question is what is the most simple way to circumevent the 405 method error. ?

I am not the best python expert yet, so I would be happy for any help 👍

best regards Gettlar

simonharlacher commented 4 years ago

Same thing happens to me. Looks like it requires JavaScript and cookies to be activated. Maybe Splash (https://splash.readthedocs.io/en/latest/) could help

rudobent commented 4 years ago

Immoscout now uses Re-Captcha. That is the problem. Maybe Selenium with proxys, maybe Captcha-Solver. No good solutions so far as I know.

levilevi10 commented 4 years ago

Getting the same problem but I'm not running it in docker. Doesn't matter anyway as what you described there. Any plans to resolve this issue? I guess other web scrapers are getting problems by this as well. Desperately need automation for this apartment search

Gettlar commented 3 years ago

Splash seems not to work.. Is there any other idea to solve the 405 issue? Come on we live almost in 2021, there must me a solution...

rudobent commented 3 years ago

I did it with Selenium and geckodriver. At first I let Selenium scroll down a lot and do random stuff, so that the algorithm of google thinks the bot is a human. Unfortunately sometimes (mostly at the beginning) google will want to let me solve a recaptcha. Here you have two options: either you can make the thing semi-supervised and solve the captcha by yourself (which seems to be okay because it doesn't appear too often). In that case you just tell the bot to stop crawling until you solved the capture (a button with tkinter for example). The other option - because yes we are nearly in 2021 - is to get a convolutional network to solve the captcha. Then you would have a unsupervised crawler which is anything but leightweight anymore. There are a lot of models here at github, even a lot pretrained models. like fastseg or keras-segmentation Unfortunately I do not work with immoscout-scrapers anymore, so I don't have a script for the unsupervised crawler. The other one is pretty easy to write.

asmaier commented 3 years ago

@Gettlar Yes, we do live in 2021. Here is a possible solution:

https://incolumitas.com/2021/01/02/breaking-audio-recaptcha-with-googles-own-speech-to-text-api/

lsch0lz commented 1 year ago

@asmaier is there still no workaround/fix for this problem? Still getting the 405 error

flo-wolf commented 1 year ago

@asmaier is there still no workaround/fix for this problem? Still getting the 405 error

I don't think so. Immoscout24 now uses Geecaptcha. I get past the 405 with Selenium, but that doesnt get me any further, since I have to manually solve the captcha. I noticed though that captchas never appear for me within my personal browser when I am logged in, so one option would be to use selenium and add your immoscout24 cookies before accessing the site, as if you were logged in already. Also, making the user_agent of the selenium driver pretty believable should be key.

In short, there might be a way, but I am still experimenting and haven't found a good solution yet that doesnt involve me having to manually solve captchas.

Update: I managed to get it working. I solved a captcha within an open selenium browser once, exported my cookies and now when i run the browser in headless mode I can just navigate to Immoscout24, load the cookies, and then go to any sub-page (i.e. listings) without having a captcha stop me.

ginwodka commented 12 months ago

@asmaier is there still no workaround/fix for this problem? Still getting the 405 error

I don't think so. Immoscout24 now uses Geecaptcha. I get past the 405 with Selenium, but that doesnt get me any further, since I have to manually solve the captcha. I noticed though that captchas never appear for me within my personal browser when I am logged in, so one option would be to use selenium and add your immoscout24 cookies before accessing the site, as if you were logged in already. Also, making the user_agent of the selenium driver pretty believable should be key.

In short, there might be a way, but I am still experimenting and haven't found a good solution yet that doesnt involve me having to manually solve captchas.

Update: I managed to get it working. I solved a captcha within an open selenium browser once, exported my cookies and now when i run the browser in headless mode I can just navigate to Immoscout24, load the cookies, and then go to any sub-page (i.e. listings) without having a captcha stop me.

Do you have more insight, what you did? I tried so save captcha with selenium and pickle. But no luck