Closed christian-draeger closed 3 years ago
Hi Christian,
you should be able to get that behavior by using the WebConnectionWrapper
and overwriting the response (copy it from a file, from a string, whatever).
Just a quick example to illustrate how to get there:
@Test
public void simpleShowCase() throws Exception {
final WebClient webClient = new WebClient(BrowserVersion.BEST_SUPPORTED);
new WebConnectionWrapper(webClient) {
@Override
public WebResponse getResponse(final WebRequest request) throws IOException {
final String body = "<html><body><div id='example'></div><script>document.getElementById('example').innerText = 'works';</script></body></html>";
List<NameValuePair> responseHeaders = new ArrayList<>();
final WebResponseData responseData = new WebResponseData(body.getBytes(), 200, "OK", responseHeaders);
final WebResponse response = new WebResponse(responseData, request, 0);
return response;
}
};
HtmlPage page = webClient.getPage("http://doesntreallymatter.com");
Assert.assertEquals(2, StringUtils.countMatches(page.asXml(), "works"));
}
Nice this looks really promising. I'll give it a try as soon as I am back to keyboard 🙂 awesome
Maybe these methods (new in 2.48.0) are your friends
WebClient.loadHtmlCodeIntoCurrentWindow(String)
WebClient.loadXHtmlCodeIntoCurrentWindow(String).
@rbri it's working like a charm.
again an kotlin example:
@Test
fun `can render html string`() {
val someHtmlIncludingEs6Script = """
<!DOCTYPE html>
<html lang="en">
<head>
<title>i'm the title</title>
</head>
<body>
i'm the body
<h1>i'm the headline</h1>
<p>i'm a paragraph</p>
<p>i'm a second paragraph</p>
</body>
<script>
const getNodesOf = (selector) => document.querySelectorAll(selector);
getNodesOf("p").forEach(p => p.innerHTML = "<span>dynamically added</span>")
</script>
</html>
""".trimIndent()
val page = WebClient(BrowserVersion.BEST_SUPPORTED).loadHtmlCodeIntoCurrentWindow(someHtmlIncludingEs6Script)
val renderedHtmlString = page.asXml()
println(renderedHtmlString)
}
/* prints :
<?xml version="1.0" encoding="UTF-8"?>
<html lang="en">
<head>
<title>
i'm the title
</title>
</head>
<body>
i'm the body
<h1>
i'm the headline
</h1>
<p>
<span>
dynamically added
</span>
</p>
<p>
<span>
dynamically added
</span>
</p>
<script>
//<![CDATA[
const getNodesOf = (selector) => document.querySelectorAll(selector);
getNodesOf("p").forEach(p => p.innerHTML = "<span>dynamically added</span>")
//]]>
</script>
</body>
</html>
*/
which leads me to my last question :D
whats the "most correct way" to consume the rendered html as string?
what i am currently doing is calling asXml()
on an HtmlPage
, which is more or less correct since theoretically every XML is also a valid HTML. but on the other hand i am having things like <?xml version="1.0" encoding="UTF-8"?>
and CDATA
-wrapper in my "renderedHtmlString".
From my point of view asXml() is the normalized complete view of the current page; asNormalizedText() is the text only view.
great. then i will continue that way. already changed to your approach and everything works smooth https://github.com/skrapeit/skrape.it/commit/d9bd8ce3576b19a23f83f3e046599d5382274b1f
Thanks, will close this.
thanks,sovle my problem:)
Is there a convenient way to pass a html string to htmlunit and get back the rendered pagesource? i am currently working around this by passing html string in data uri scheme as an url, which is working but not really convenient.
here is a little example test that on the one hand illustrates the described workaround:
as you can see all
p
-tags text has been overwritten by javascript. great, exactly what i want.❓ so whats my issue with this? --> an url will have a max length and if you can imagine a more complex html converted to a base64 data uri string can easily exceed this limit, thereby this solutions only works for "simple" websites.
💡 would you mind to add a feature that allows it to pass an html string to htmlunit and get rendered? maybe it is even already there and i just didn't found it?