David-Carrasco / Scrapy-Idealista

Scrapping data from Real Estate site www.idealista.com
GNU General Public License v2.0
158 stars 62 forks source link

Issue on default scrap execution #11

Open JuanCarlosCamara opened 3 years ago

JuanCarlosCamara commented 3 years ago

Hello,

I´m trying to execute default scrapping from Carabanchel and I´m receiving errors about dead proxies.

I have tried to upload the DOWNLOAD_TIMEOUT parameter in settings.py from 10 to 20 seconds as I have read in other issues and it still returns same error.

Do you have any idea or what could be happening? I don´t know if maybe idealista has included some kind of check or security to avoid scrapping since October.

Thanks a lot for your help. Best regards, Juan Carlos Cámara

VictorUceda commented 3 years ago

same error today

abidhsn commented 3 years ago

Facing the exact same issue...the proxies are dead

somefreestring commented 3 years ago

yes all the proxies are dead, removing the middleware seems to ban you in minutes

ewoks commented 3 years ago

@David-Carrasco is this expected or temporal issue? Do we do something wrong? Thanks

foodaka commented 3 years ago

Any idea on how to fix this?

David-Carrasco commented 3 years ago

I've tested the scraper again as @JuanCarlosCamara said, it seems Idealista has added some check to block requests of all the proxies which come from https://free-proxy-list.net/

The idea is to provide another list of proxies to be used by the scraper declared in:

https://github.com/David-Carrasco/Scrapy-Idealista/blob/afc34add09d6f9b191bcc23a033c89ab4b1b46b8/idealista/settings.py#L62

It should work since I've tested the scraper with my own ip and it works. It's a proxies issue.

agusriscos commented 3 years ago

I am still stuck with the same problem. Has anyone got a valid proxy list?

IsmaPons commented 3 years ago

hello, what about this?

https://github.com/clarketm/proxy-list/blob/master/proxy-list.txt

We'd still need to modify the code though

mikemajara commented 3 years ago

@IsmaPons doesn't seem to be working either. I'm curious to know what the result was back in your first attempt. I'm getting DEAD proxies all over the place. Just now testing this one out.

mikemajara commented 3 years ago

Apparently the problem is not the proxies themselves (several other lists which are healthy don't work either). The issue here seems to be a 403 returned as the the bot is detected 🤖 . So the proxy middleware keeps trying with no success but it's really the HTTP 403 whats blocking you from success. Even when all settings and variables are correctly set, Google Analytics has apparently has an anti-bot feature 🧐.

Long story short, try

Useful links: SO answer, Splash, User-Agent.

davidrdguezarias commented 2 years ago

@mikemajara @David-Carrasco did you test with the proposed solution on 7th November? Is it working?

mikemajara commented 2 years ago

I did to no avail. It did work once. But it's hard to get around it.