HtmlUnit / htmlunit

HtmlUnit is a "GUI-Less browser for Java programs".
https://www.htmlunit.org
Apache License 2.0
858 stars 170 forks source link

HTML content doesn't load to show <div data-module-*> on scraping Udemy website. #577

Open abhayjohri23 opened 1 year ago

abhayjohri23 commented 1 year ago

On trying to scrap the content (Example Thumbnail picture of a course, price etc.) from an educative website - Udemy and searching in a general URL string (given in code snippet). The source code of the site has a division with class name - "ud-app-loader ud-component--search--search" and also sub-divisions for the courses presented on screen with div class="popper-module--popper--2BpLn".

HTMLUnit issue

Code used to get the HTML content from the website:

public static void getData(String courseName,String sortType) throws Exception {
        String URL="https://www.udemy.com/courses/search/?lang=en&price=price-paid&q="+courseName+
                "&ratings=4.5&sort=relevance&sort="+sortType+"&src=ukw";

        WebClient client=new WebClient(BrowserVersion.FIREFOX);
        client.getOptions().setJavaScriptEnabled(true);
        client.getOptions().setCssEnabled(true);
        client.getOptions().setThrowExceptionOnScriptError(false);
        client.setAjaxController(new NicelyResynchronizingAjaxController());

        HtmlPage page=client.getPage(URL);
        client.waitForBackgroundJavaScript(500000);
        System.out.println(page.asXml());
    }

On using the above code, the Javascript scripts are not loading properly to display the additional code snippet, which is visible in Inspect section of browser but not in source code.

Getting too many EvaluatorException exceptions at various places also. A glimpse of such an exception is as follows:

======= EXCEPTION START ======== Exception class=[org.htmlunit.corejs.javascript.EvaluatorException] org.htmlunit.ScriptException: An invalid or illegal selector was specified (selector: '[data-css-toggle-id' error: Invalid selectors: [data-css-toggle-id). (script in https://www.udemy.com/courses/search/?lang=en&price=price-paid&q=python&ratings=4.5&sort=relevance&sort=relevance&src=ukw from (557, 62) to (582, 10)#577) at org.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:989) at org.htmlunit.corejs.javascript.Context.call(Context.java:590) at org.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:484) at org.htmlunit.javascript.HtmlUnitContextFactory.callSecured(HtmlUnitContextFactory.java:349) at org.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:867) at org.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:843) at org.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:834) at org.htmlunit.html.HtmlPage.executeJavaScript(HtmlPage.java:966) at org.htmlunit.html.ScriptElementSupport.executeInlineScriptIfNeeded(ScriptElementSupport.java:380) at org.htmlunit.html.ScriptElementSupport.executeScriptIfNeeded(ScriptElementSupport.java:230) at org.htmlunit.html.ScriptElementSupport$1.execute(ScriptElementSupport.java:120) at org.htmlunit.html.ScriptElementSupport.onAllChildrenAddedToPage(ScriptElementSupport.java:143) at org.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:191) at org.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.endElement(HtmlUnitNekoDOMBuilder.java:601) at org.htmlunit.cyberneko.xerces.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:412) at org.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.endElement(HtmlUnitNekoDOMBuilder.java:548) at org.htmlunit.cyberneko.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1273) at org.htmlunit.cyberneko.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1200) at org.htmlunit.cyberneko.filters.DefaultFilter.endElement(DefaultFilter.java:204) at org.htmlunit.cyberneko.filters.NamespaceBinder.endElement(NamespaceBinder.java:274) at org.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:2969) at org.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1953) at org.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:834) at org.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:346) at org.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:297) at org.htmlunit.cyberneko.xerces.parsers.XMLParser.parse(XMLParser.java:76) at org.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:838) at org.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:203) at org.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:300) at org.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:220) at org.htmlunit.WebClient.loadWebResponseInto(WebClient.java:672) at org.htmlunit.WebClient.loadWebResponseInto(WebClient.java:574) at org.htmlunit.WebClient.getPage(WebClient.java:492) at org.htmlunit.WebClient.getPage(WebClient.java:399) at org.htmlunit.WebClient.getPage(WebClient.java:537) at org.htmlunit.WebClient.getPage(WebClient.java:519) at org.example.Scraper.getData(Scraper.java:20) at org.example.App.main(App.java:16) Caused by: org.htmlunit.corejs.javascript.EvaluatorException: An invalid or illegal selector was specified (selector: '[data-css-toggle-id' error: Invalid selectors: [data-css-toggle-id). (script in https://www.udemy.com/courses/search/?lang=en&price=price-paid&q=python&ratings=4.5&sort=relevance&sort=relevance&src=ukw from (557, 62) to (582, 10)#577) at org.htmlunit.javascript.HtmlUnitContextFactory$HtmlUnitErrorReporter.runtimeError(HtmlUnitContextFactory.java:454) at org.htmlunit.corejs.javascript.Context.reportRuntimeError(Context.java:986) at org.htmlunit.corejs.javascript.Context.reportRuntimeError(Context.java:1042) at org.htmlunit.javascript.host.dom.Document.querySelectorAll(Document.java:1044) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:222) at org.htmlunit.corejs.javascript.FunctionObject.call(FunctionObject.java:423) at org.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1874) at org.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:1051) at org.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:89) at org.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:392) at org.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:335) at org.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3914) at org.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:102) at org.htmlunit.javascript.JavaScriptEngine$2.doRun(JavaScriptEngine.java:858) at org.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:972) ... 37 more Enclosed exception: org.htmlunit.corejs.javascript.EvaluatorException: An invalid or illegal selector was specified (selector: '[data-css-toggle-id' error: Invalid selectors: [data-css-toggle-id). (script in https://www.udemy.com/courses/search/?lang=en&price=price-paid&q=python&ratings=4.5&sort=relevance&sort=relevance&src=ukw from (557, 62) to (582, 10)#577) at org.htmlunit.javascript.HtmlUnitContextFactory$HtmlUnitErrorReporter.runtimeError(HtmlUnitContextFactory.java:454) at org.htmlunit.corejs.javascript.Context.reportRuntimeError(Context.java:986) at org.htmlunit.corejs.javascript.Context.reportRuntimeError(Context.java:1042) at org.htmlunit.javascript.host.dom.Document.querySelectorAll(Document.java:1044) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:222) at org.htmlunit.corejs.javascript.FunctionObject.call(FunctionObject.java:423) at org.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1874) at script(script in https://www.udemy.com/courses/search/?lang=en&price=price-paid&q=python&ratings=4.5&sort=relevance&sort=relevance&src=ukw from (557, 62) to (582, 10):577) at script(script in https://www.udemy.com/courses/search/?lang=en&price=price-paid&q=python&ratings=4.5&sort=relevance&sort=relevance&src=ukw from (557, 62) to (582, 10):576) at org.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:1051) at org.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:89) at org.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:392) at org.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:335) at org.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3914) at org.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:102) at org.htmlunit.javascript.JavaScriptEngine$2.doRun(JavaScriptEngine.java:858) at org.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:972) at org.htmlunit.corejs.javascript.Context.call(Context.java:590) at org.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:484) at org.htmlunit.javascript.HtmlUnitContextFactory.callSecured(HtmlUnitContextFactory.java:349) at org.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:867) at org.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:843) at org.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:834) at org.htmlunit.html.HtmlPage.executeJavaScript(HtmlPage.java:966) at org.htmlunit.html.ScriptElementSupport.executeInlineScriptIfNeeded(ScriptElementSupport.java:380) at org.htmlunit.html.ScriptElementSupport.executeScriptIfNeeded(ScriptElementSupport.java:230) at org.htmlunit.html.ScriptElementSupport$1.execute(ScriptElementSupport.java:120) at org.htmlunit.html.ScriptElementSupport.onAllChildrenAddedToPage(ScriptElementSupport.java:143) at org.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:191) at org.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.endElement(HtmlUnitNekoDOMBuilder.java:601) at org.htmlunit.cyberneko.xerces.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:412) at org.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.endElement(HtmlUnitNekoDOMBuilder.java:548) at org.htmlunit.cyberneko.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1273) at org.htmlunit.cyberneko.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1200) at org.htmlunit.cyberneko.filters.DefaultFilter.endElement(DefaultFilter.java:204) at org.htmlunit.cyberneko.filters.NamespaceBinder.endElement(NamespaceBinder.java:274) at org.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:2969) at org.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1953) at org.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:834) at org.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:346) at org.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:297) at org.htmlunit.cyberneko.xerces.parsers.XMLParser.parse(XMLParser.java:76) at org.htmlunit.html.parser.neko.HtmlUnitNekoDOMBuilder.parse(HtmlUnitNekoDOMBuilder.java:838) at org.htmlunit.html.parser.neko.HtmlUnitNekoHtmlParser.parse(HtmlUnitNekoHtmlParser.java:203) at org.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:300) at org.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:220) at org.htmlunit.WebClient.loadWebResponseInto(WebClient.java:672) at org.htmlunit.WebClient.loadWebResponseInto(WebClient.java:574) at org.htmlunit.WebClient.getPage(WebClient.java:492) at org.htmlunit.WebClient.getPage(WebClient.java:399) at org.htmlunit.WebClient.getPage(WebClient.java:537) at org.htmlunit.WebClient.getPage(WebClient.java:519) at org.example.Scraper.getData(Scraper.java:20) at org.example.App.main(App.java:16) ======= EXCEPTION END ======== Stackoverflow thread of this question (for complete context): How to extract the HTML elements inside <div data-module-*> from a website source code using HTMLUnit?

abhayjohri23 commented 1 year ago

@rbri Can you please help me getting a way out of this?

rbri commented 1 year ago

Thanks for all the details - will have a deeper look and come back to you during the next days.

rbri commented 5 months ago

@abhayjohri23 sorry this got lost - still interested in this?