HtmlUnit / htmlunit

HtmlUnit is a "GUI-Less browser for Java programs".
https://www.htmlunit.org
Apache License 2.0
867 stars 172 forks source link

Render HTML from String #331

Closed christian-draeger closed 3 years ago

christian-draeger commented 3 years ago

Is there a convenient way to pass a html string to htmlunit and get back the rendered pagesource? i am currently working around this by passing html string in data uri scheme as an url, which is working but not really convenient.

here is a little example test that on the one hand illustrates the described workaround:

    @Test
    fun `can render html string`() {
        val someHtmlIncludingEs6Script = """
            <!DOCTYPE html>
            <html lang="en">
                <head>
                    <title>i'm the title</title>
                </head>
                <body>
                    i'm the body
                    <h1>i'm the headline</h1>
                    <p>i'm a paragraph</p>
                    <p>i'm a second paragraph</p>
                </body>
                <script>
                    const getNodesOf = (selector) => document.querySelectorAll(selector);
                    getNodesOf("p").forEach(p => p.innerHTML = "<span>dynamically added</span>")
                </script>
            </html>
        """.trimIndent()

        val dataUriMimeType = "data:text/html;charset=UTF-8;"
        val base64encoded = Base64.getEncoder().encodeToString(someHtmlIncludingEs6Script.toByteArray())
        val dataUri = "${dataUriMimeType}base64,$base64encoded"

        val client = WebClient(BrowserVersion.BEST_SUPPORTED)
        val page: Page = client.getPage(dataUri)
        val httpResponse = page.webResponse
        val document = when {
            page.isHtmlPage -> (page as HtmlPage).asXml()
            else -> httpResponse.contentAsString
        }

        expectThat(document).isEqualTo("""
            |<?xml version="1.0" encoding="UTF-8"?>
            |<html lang="en">
            |  <head>
            |    <title>
            |      i'm the title
            |    </title>
            |  </head>
            |  <body>
            |    
            |        i'm the body
            |        
            |    <h1>
            |      i'm the headline
            |    </h1>
            |    <p>
            |      <span>
            |        dynamically added
            |      </span>
            |    </p>
            |    <p>
            |      <span>
            |        dynamically added
            |      </span>
            |    </p>
            |    <script>
            |//<![CDATA[
            |
            |        const getNodesOf = (selector) => document.querySelectorAll(selector);
            |        getNodesOf("p").forEach(p => p.innerHTML = "<span>dynamically added</span>")
            |    
            |//]]>
            |    </script>
            |  </body>
            |</html>
            |
        """.trimMargin())
    }

as you can see all p-tags text has been overwritten by javascript. great, exactly what i want.

❓ so whats my issue with this? --> an url will have a max length and if you can imagine a more complex html converted to a base64 data uri string can easily exceed this limit, thereby this solutions only works for "simple" websites.

💡 would you mind to add a feature that allows it to pass an html string to htmlunit and get rendered? maybe it is even already there and i just didn't found it?

twendelmuth commented 3 years ago

Hi Christian,

you should be able to get that behavior by using the WebConnectionWrapper and overwriting the response (copy it from a file, from a string, whatever).

Just a quick example to illustrate how to get there:

@Test
    public void simpleShowCase() throws Exception {
        final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED);
        new WebConnectionWrapper(webClient) {
            @Override
            public WebResponse getResponse(final WebRequest request) throws IOException {
                final String body = "<html><body><div id='example'></div><script>document.getElementById('example').innerText = 'works';</script></body></html>";

                List<NameValuePair> responseHeaders = new ArrayList<>();
                final WebResponseData responseData = new WebResponseData(body.getBytes(), 200, "OK", responseHeaders);
                final WebResponse response = new WebResponse(responseData, request, 0);

                return response;
            }
        };

        HtmlPage page = webClient.getPage("http://doesntreallymatter.com");
        Assert.assertEquals(2, StringUtils.countMatches(page.asXml(), "works"));
    }
christian-draeger commented 3 years ago

Nice this looks really promising. I'll give it a try as soon as I am back to keyboard 🙂 awesome

rbri commented 3 years ago

Maybe these methods (new in 2.48.0) are your friends

WebClient.loadHtmlCodeIntoCurrentWindow(String) WebClient.loadXHtmlCodeIntoCurrentWindow(String).

christian-draeger commented 3 years ago

@rbri it's working like a charm.

again an kotlin example:

    @Test
    fun `can render html string`() {
        val someHtmlIncludingEs6Script = """
            <!DOCTYPE html>
            <html lang="en">
                <head>
                    <title>i'm the title</title>
                </head>
                <body>
                    i'm the body
                    <h1>i'm the headline</h1>
                    <p>i'm a paragraph</p>
                    <p>i'm a second paragraph</p>
                </body>
                <script>
                    const getNodesOf = (selector) => document.querySelectorAll(selector);
                    getNodesOf("p").forEach(p => p.innerHTML = "<span>dynamically added</span>")
                </script>
            </html>
        """.trimIndent()

        val page = WebClient(BrowserVersion.BEST_SUPPORTED).loadHtmlCodeIntoCurrentWindow(someHtmlIncludingEs6Script)
        val renderedHtmlString = page.asXml()
        println(renderedHtmlString)

    }

/* prints :

        <?xml version="1.0" encoding="UTF-8"?>
        <html lang="en">
          <head>
            <title>
              i'm the title
            </title>
          </head>
          <body>

                            i'm the body

            <h1>
              i'm the headline
            </h1>
            <p>
              <span>
                dynamically added
              </span>
            </p>
            <p>
              <span>
                dynamically added
              </span>
            </p>
            <script>
        //<![CDATA[

                            const getNodesOf = (selector) => document.querySelectorAll(selector);
                            getNodesOf("p").forEach(p => p.innerHTML = "<span>dynamically added</span>")

        //]]>
            </script>
          </body>
        </html>
*/

which leads me to my last question :D whats the "most correct way" to consume the rendered html as string? what i am currently doing is calling asXml() on an HtmlPage, which is more or less correct since theoretically every XML is also a valid HTML. but on the other hand i am having things like <?xml version="1.0" encoding="UTF-8"?> and CDATA-wrapper in my "renderedHtmlString".

rbri commented 3 years ago

From my point of view asXml() is the normalized complete view of the current page; asNormalizedText() is the text only view.

christian-draeger commented 3 years ago

great. then i will continue that way. already changed to your approach and everything works smooth https://github.com/skrapeit/skrape.it/commit/d9bd8ce3576b19a23f83f3e046599d5382274b1f

rbri commented 3 years ago

Thanks, will close this.

pk604437000 commented 2 years ago

thanks,sovle my problem:)