HtmlUnit / htmlunit

HtmlUnit is a "GUI-Less browser for Java programs".
https://www.htmlunit.org
Apache License 2.0
870 stars 172 forks source link

JS Browserdetection fail and redirect #368

Open toniritter opened 3 years ago

toniritter commented 3 years ago

based on JavaScript execution exeption question on Stackoverflow

HtmlUnit Version: 2.50.0

During getPage call of webpage flashscore.com, i got following exceptions

2021-07-07 08:46:05.408  WARN 4828 --- [nio-8080-exec-1] c.g.htmlunit.IncorrectnessListenerImpl   : Obsolete content type encountered: 'text/javascript'.
2021-07-07 08:46:05.564 ERROR 4828 --- [nio-8080-exec-1] c.g.h.j.DefaultJavaScriptErrorListener   : Error during JavaScript execution

com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot find function entries in object function Object() { [native code] }. (script in https://www.flashscore.com/unsupported/ from (31, 9) to (53, 10)#35)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:949) ~[htmlunit-2.50.0.jar:2.50.0]
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:598) ~[htmlunit-core-js-2.50.0.jar:na]
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:487) ~[htmlunit-core-js-2.50.0.jar:na]
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.callSecured(HtmlUnitContextFactory.java:353) ~[htmlunit-2.50.0.jar:2.50.0]
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:829) ~[htmlunit-2.50.0.jar:2.50.0]
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:805) ~[htmlunit-2.50.0.jar:2.50.0]
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:796) ~[htmlunit-2.50.0.jar:2.50.0]
    at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScript(HtmlPage.java:942) ~[htmlunit-2.50.0.jar:2.50.0]
    at com.gargoylesoftware.htmlunit.html.ScriptElementSupport.executeInlineScriptIfNeeded(ScriptElementSupport.java:378) ~[htmlunit-2.50.0.jar:2.50.0]

I've tried with two different classes and problem still occur.

@PostMapping("/startScraping")
    public ResponseEntity<FlashScraper> startScraping(@NonNull @RequestBody FlashScraper flashScraper) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
        logger.info("startScraping request incomming");
        logger.info("Call URL: " + flashScraper.getScrapeUrl());

        String url = "https://flashScore.com";

        try (final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED)) {
            HtmlPage page = webClient.getPage(url);
            webClient.waitForBackgroundJavaScript(3_000);

            System.out.println();
            System.out.println();
            System.out.println("----------------");
            System.out.println(page.asNormalizedText());
            System.out.println("----------------");
        }

        return new ResponseEntity(flashScraper, HttpStatus.OK);
    }
@PostMapping("/startScraping")
    public ResponseEntity<FlashScraper> startScraping(@NonNull @RequestBody FlashScraper flashScraper) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
        logger.info("startScraping request incomming");
        logger.info("Call URL: " + flashScraper.getScrapeUrl());

        final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.waitForBackgroundJavaScriptStartingBefore(1000);

        HtmlPage scrapePage = webClient.getPage(flashScraper.getScrapeUrl());
        webClient.waitForBackgroundJavaScript(3000);

        System.out.println(scrapePage.getByXPath("//*[@id=\"g_25_rwPxTVj1\"]"));

        return new ResponseEntity(flashScraper, HttpStatus.OK);
    }
toniritter commented 3 years ago

After switch Dependency to 2.51.0 version, the exception is not thrown anymore but still i'm on the "Unsupported" page https://flashscore.com/unsupported/

rbri commented 3 years ago

The browser detection is done using this https://www.flashscore.com/x/js/browsercompatibility_4.js code

// !!! for update iterate manually `browser_compatibility_serial`
"use strict";
try {
    (function () {
        var cssRequirements = [["display", "flex"], ["display", "grid"], ["color", "red"]];
        for (var i in cssRequirements) {
            if (!CSS.supports(cssRequirements[i][0], cssRequirements[i][1])) {
                throw "no-" + cssRequirements[i][0] + "-" + cssRequirements[i][1];
            }
        }
        try {
            new XMLHttpRequest();
        }
        catch (pass) {
            throw "no-ajax";
        }
        try {
            eval("var foo = (x)=>x+1");
        }
        catch (pass) {
            throw "no-es6";
        }
        try {
            eval("var foo = {}; var bar = {...foo};")
        }
        catch (pass) {
            throw "no-spread";
        }
    })();
}
catch (e) {
    var utm = "";
    if (typeof e == "string" && /^[a-z0-9\-]+$/.test(e)) {
        utm = "?err=" + e;
    }
    window.location.replace("/unsupported/" + utm);
}

For the moment i can fix CSS.supports() but because Rhino not (yet) supports the spread syntax (https://github.com/mozilla/rhino/issues/968) this will still fail.

The only option you have is to 'patch' the script and replace comment out some parts (see https://htmlunit.sourceforge.io/faq.html#HowToModifyRequestOrResponse). At least it is worth a try

rbri commented 3 years ago

Have done a fix for CSS.supports() - will make a new snapshot available soon (check twitter for updates)

toniritter commented 3 years ago

I've done it as suggested and try modify the response but got now following exception on it (still on version 2.51.0

2021-07-12 19:23:13.844 ERROR 2820 --- [nio-8080-exec-2] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is com.gargoylesoftware.htmlunit.ScriptException: syntax error (https://www.flashscore.com/x/js/browsercompatibility_4.js#1)] with root cause

net.sourceforge.htmlunit.corejs.javascript.EvaluatorException: syntax error (https://www.flashscore.com/x/js/browsercompatibility_4.js#1)
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory$HtmlUnitErrorReporter.error(HtmlUnitContextFactory.java:436) ~[htmlunit-2.51.0.jar:2.51.0]
    at net.sourceforge.htmlunit.corejs.javascript.Parser.addError(Parser.java:251) ~[htmlunit-core-js-2.51.0.jar:na]
rbri commented 3 years ago

looks like there is a syntax error in your replaced script - maybe you can replace it by an empty one?

toniritter commented 3 years ago

Hey rbri, i've tried it meanwhile with this but it will still faile:

    public void startScraper() throws FailingHttpStatusCodeException, MalformedURLException, IOException {

        String url = "https://www.flashscore.com/basketball/";

        try (final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED)) {

            webClient.getOptions().setThrowExceptionOnScriptError(false);
            webClient.getOptions().setUseInsecureSSL(true);
            webClient.getOptions().setCssEnabled(true);
            webClient.getOptions().setJavaScriptEnabled(true);
            webClient.waitForBackgroundJavaScriptStartingBefore(1000);

            new WebConnectionWrapper(webClient) {

                public WebResponse getResponse(WebRequest request) throws IOException {
                    WebResponse response = super.getResponse(request);
                    if (request.getUrl().toExternalForm().contains("browsercompatibility")) {
                        String content = "";
                        // intercept and/or change content

                        WebResponseData data = new WebResponseData(content.getBytes(),response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
                        response = new WebResponse(data, request, response.getLoadTime());
                    }
                    return response;
                }
            };

            HtmlPage page = webClient.getPage(url);
            webClient.waitForBackgroundJavaScript(3_000);

            System.out.println();
            System.out.println();
            System.out.println("----------------");
            System.out.println(page.asNormalizedText());
            System.out.println("----------------");
        }

    }
2021-07-16 15:22:45.844  WARN 1524 --- [           main] c.g.htmlunit.DefaultCssErrorHandler      : CSS error: 'https://www.flashscore.com/res/_fs/build/livetableresponsive.c7059bf.css' [1:8910] Error in pseudo class or element. (Invalid token ".". Was expecting one of: <S>, <NUMBER>, <IDENT>, <STRING>, "-", <PLUS>, <DIMENSION>.)
2021-07-16 15:22:45.844  WARN 1524 --- [           main] c.g.htmlunit.DefaultCssErrorHandler      : CSS warning: 'https://www.flashscore.com/res/_fs/build/livetableresponsive.c7059bf.css' [1:8910] Ignoring the whole rule.
2021-07-16 15:22:46.305  WARN 1524 --- [           main] c.g.htmlunit.IncorrectnessListenerImpl   : Obsolete content type encountered: 'text/javascript'.
2021-07-16 15:22:46.487 ERROR 1524 --- [           main] c.g.h.j.DefaultJavaScriptErrorListener   : Error during JavaScript execution

com.gargoylesoftware.htmlunit.ScriptException: invalid property id (https://www.flashscore.com/res/_fs/build/loader.5714507.js#1)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:954) ~[htmlunit-2.51.0.jar:2.51.0]
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:580) ~[htmlunit-core-js-2.51.0.jar:na]
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:481) ~[htmlunit-core-js-2.51.0.jar:na]
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.callSecured(HtmlUnitContextFactory.java:352) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.compile(JavaScriptEngine.java:785) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.compile(JavaScriptEngine.java:751) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.compile(JavaScriptEngine.java:112) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadJavaScriptFromUrl(HtmlPage.java:1122) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1002) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.html.ScriptElementSupport.executeScriptIfNeeded(ScriptElementSupport.java:196) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.html.ScriptElementSupport$1.execute(ScriptElementSupport.java:120) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.html.ScriptElementSupport.onAllChildrenAddedToPage(ScriptElementSupport.java:143) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:191) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.endElement(HtmlUnitNekoDOMBuilder.java:551) ~[htmlunit-2.51.0.jar:2.51.0]
    at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) ~[xercesImpl-2.12.0.jar:na]
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.endElement(HtmlUnitNekoDOMBuilder.java:503) ~[htmlunit-2.51.0.jar:2.51.0]
    at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1216) ~[neko-htmlunit-2.51.0.jar:2.51.0]
    at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1156) ~[neko-htmlunit-2.51.0.jar:2.51.0]
    at net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter.endElement(DefaultFilter.java:219) ~[neko-htmlunit-2.51.0.jar:2.51.0]
    at net.sourceforge.htmlunit.cyberneko.filters.NamespaceBinder.endElement(NamespaceBinder.java:312) ~[neko-htmlunit-2.51.0.jar:2.51.0]
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3189) ~[neko-htmlunit-2.51.0.jar:2.51.0]
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2114) ~[neko-htmlunit-2.51.0.jar:2.51.0]
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:937) ~[neko-htmlunit-2.51.0.jar:2.51.0]
    at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:443) ~[neko-htmlunit-2.51.0.jar:2.51.0]
    at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:394) ~[neko-htmlunit-2.51.0.jar:2.51.0]
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) ~[xercesImpl-2.12.0.jar:na]
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:751) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:208) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:297) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:217) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:684) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:586) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:501) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:413) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:548) ~[htmlunit-2.51.0.jar:2.51.0]
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:529) ~[htmlunit-2.51.0.jar:2.51.0]
Caused by: net.sourceforge.htmlunit.corejs.javascript.EvaluatorException: invalid property id (https://www.flashscore.com/res/_fs/build/loader.5714507.js#1)
rbri commented 3 years ago

Looks like another error - this time

invalid property id (https://www.flashscore.com/res/_fs/build/loader.5714507.js#1)

And this js is a huge minimized javascript. At least this uses the not supported syntax

function(...e){let t=this._configData;

I fear you have to wait until this is fixed in Rhino.

rbri commented 6 months ago

see #755