flathunters / flathunter

A bot to help people with their rental real-estate search. 🏠🤖
GNU Affero General Public License v3.0
831 stars 179 forks source link

Captcha failing repeatedly after successful first run #153

Closed namnoops closed 2 years ago

namnoops commented 2 years ago

I've recently restarted using flathunter with 2Captcha. I'm running with a 1hr cycle, and usually the first run works perfectly - captcha takes 15-20 seconds to resolve and the script continues. Mostly on the 2nd attempt (but not always), the captcha will not get solved which results in what looks like multiple retries. If I check my 2Captcha account there are dozens of charges - but accordingly also refunds processed for them.

Below is an example log from my last run. You can see that at 19:13 and at 20:14 everything worked flawlessly, but then on the next cycle (for some reason about half an hour late) - the captcha fails and is retried for about 50 minutes until the script fails.

I contacted the 2Captcha support in hope that they could reveal if the problem is on their side. This was their response:

Have you ever received a GeeTest token from our API? I suppose you are taking the challenge value of a rendered GeeTest widget. In such case you will always get ERROR_CAPTCHA_UNSOLVABLE as GeeTest widget can not be rendered twice with the same challenge. https://github.com/flathunters/flathunter/blob/main/flathunter/abstract_crawler.py#L164 I'm not good in python and selenium, but I'm pretty sure that driver.page_source returns the source of the page already rendered in a browser. You can simply make a GET requests with requests library, parse the page source and find the challenge value that can be used to solve the GeeTest. So, just make sure you use the challenge value that was never used to render a GeeTest widget.

2022-02-08 flathunter.txt

Any help appreciated!

codders commented 2 years ago

We had this reported by another user, that now reports that it is fixed for them (#158). Are you still having the problem?

markuswestphal commented 2 years ago

"Another user" here 🙋🏼‍♂️: the issue has been present again for the last 4 days. Other users now report the same issue, as in https://github.com/flathunters/flathunter/issues/160

markuswestphal commented 2 years ago

I tinkered with what the 2Captcha support suggested and tried to do simple GET requests instead of accessing driver.page_source. In the process I learnt a lot about the abstract_crawler.py code. Though i know understand that it won't be that easy, I think there might be something to what they suggest. Do you think this could be something @codders ?

codders commented 2 years ago

Sounds interesting. Can you be more concrete about what they suggest and what you've tried and any results you had?

Thanks!

Markus Westphal @.***> schrieb am Mi., 30. März 2022, 01:26:

I tinkered with what the 2Captcha support suggested and tried to do simple GET requests instead of accessing driver.page_source. In the process I learnt a lot about the abstract_crawler.py code. Though i know understand that it won't be that easy, I think there might be something to what they suggest. Do you think this could be something @codders https://github.com/codders ?

— Reply to this email directly, view it on GitHub https://github.com/flathunters/flathunter/issues/153#issuecomment-1082464106, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAEK5UWK23C3IBTCPH346TVCOGR3ANCNFSM5N6WA7XQ . You are receiving this because you were mentioned.Message ID: @.***>

markuswestphal commented 2 years ago

I will investigate a little further and post my thoughts and findings. In the meantime I checked and found that 2Captcha changed their API to incorporate Geetest V4 support on March 24, 2022. https://2captcha.com/de/2captcha-api#recent_changes I just checked my search history and found that this is exactly the date that the Immoscout crawler stopped working for me. This might be a hot clue. The old Geetest request seems to be constructed as before but the timing here is really odd.

iwasherefirst2 commented 2 years ago

Could be that immobilienscout24 switched to geetest_4 ? In this case we would need to change in the method name in https://github.com/flathunters/flathunter/blob/a3c948c762e2610a87a8c834a1c72beda5817dd3/flathunter/abstract_crawler.py#L169 to

f"http://2captcha.com/in.php?key={api_key}&method=geetest_4&gt={gt}&challenge={challenge}&api_server=api.geetest.com&pageurl={urllib.parse.quote_plus(driver.current_url)}"

However, I can't test it at the moment, as I either get session timeouts (https://github.com/flathunters/flathunter/issues/145) or a new error is RemoteDisconnected "urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))".. it really looks like setting up pyton on my VPS with selenium is just not working for me.

markuswestphal commented 2 years ago

IM24 did not switch to geetest_4, I checked.

iwasherefirst2 commented 2 years ago

The problem is on 2captchas side. Sometimes they can't handle the server load and then they just can't solve the geetests. We should think of setting a time-limit for trying out geetests and definetly handling 500er responses from their API. I would suggest to close this issue and continue discussion at https://github.com/flathunters/flathunter/issues/162

alexanderroidl commented 2 years ago

The problem is on 2captchas side. Sometimes they can't handle the server load and then they just can't solve the geetests. We should think of setting a time-limit for trying out geetests and definetly handling 500er responses from their API. I would suggest to close this issue and continue discussion at #162