berstend / puppeteer-extra

💯 Teach puppeteer new tricks through plugins.
https://extra.community
MIT License
6.23k stars 731 forks source link

[Bug] hCaptcha detect fails due to new hCaptcha URL format #881

Open vetinary opened 3 months ago

vetinary commented 3 months ago

The plugin works great with reCAPTCHA, however it throws an error on pages with hCaptcha

After some investigations, I came to the following problem:

There is a block of code inside the plugin, where hCaptcha parameters are extracted:

_extractInfoFromIframes(iframes) {
        return iframes
            .map(el => el.src.replace('.html#', '.html?'))
            .map(url => {
            const { searchParams } = new URL(url);
            const result = {
                _vendor: 'hcaptcha',
                url: document.location.href,
                id: searchParams.get('id'),
                sitekey: searchParams.get('sitekey'),
                display: {
                    size: searchParams.get('size') || 'normal'
                }
            };
            return result;
        });
    }

The hCaptcha iframe URL has the following format:

https://newassets.hcaptcha.com/captcha/v1/c44fc00/static/hcaptcha.html?_v=h8ew9h1l07#frame=challenge&id=0t7tnh8gx2un&host=mysite.com&sentry=undefined&reportapi=https%3A%2F%2Faccounts.hcaptcha.com&recaptchacompat=true&custom=false&tplinks=on&pstissuer=https%3A%2F%2Fpst-issuer.hcaptcha.com&sitekey=cf0b9a27-82e3-42fb-bfec-562f8045e495&size=invisible&theme=light&origin=https%3A%2F%2Fmysite.com

Since there is no substring .html# (html is followed by ?_v=…) the URL stays unmodified, and parameters like id, sitekey and size can't be extracted from the query string.

As a result, in logs I get message:

PuppeteerExtraPluginRecaptcha: An error occured during "getRecaptchaSolutions": {
  _vendor: 'hcaptcha',
  provider: '2captcha',
  error: 'Error: Missing data in captcha'
}

I think, the quick workaround colud be something like: if there is .html? in iframe URL, just replace '#' with '&', which will make _v a common GET-parameter, otherwise replace .html# with .html?