Open Jin9628 opened 11 months ago
do you have minimal sample to let me reproduce this?
do you have minimal sample to let me reproduce this?
This is the demo that I used:
public static String getTextByHtmlUrl(String htmlUrl) {
if (StringUtils.isBlank(htmlUrl)) {
return StringUtils.EMPTY;
}
WebClient webClient = createWebClient();
HtmlPage page = null;
String text = "";
try {
page = webClient.getPage(htmlUrl);
webClient.waitForBackgroundJavaScript(LOAD_BACKGROUND_JAVASCRIPT_TIME);
String pageXml = page.asXml();
Document document = Jsoup.parse(pageXml);
Elements body = document.select(BODY);
text = body.get(0).text();
Elements iframes = document.select(IFRAME);
if (CollectionUtils.isEmpty(iframes)) {
return text;
}
return getTextByInnerInframe(text, htmlUrl, iframes, webClient);
} catch (Throwable e) {
LoggerUtil.warn(LOGGER, "HtmlAnalysisUtil->getTextByHtmlUrl error");
return StringUtils.EMPTY;
} finally {
webClient.close();
}
}
private static String getTextByInnerInframe(String sourceText, String sourceUrl, Elements iframes, WebClient webClient) {
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.append(sourceText);
iframes.stream().forEach(iframe -> {
String iframeUrl = iframe.attr(SRC);
if (StringUtils.isNotBlank(iframeUrl)) {
iframeUrl = buildValidUrl(sourceUrl, iframeUrl);
try {
HtmlPage innerPage = webClient.getPage(iframeUrl);
webClient.waitForBackgroundJavaScript(LOAD_BACKGROUND_JAVASCRIPT_TIME);
String innerPageXml = innerPage.asXml();
Document innerDocument = Jsoup.parse(innerPageXml);
Elements innerBody = innerDocument.select(BODY);
stringBuilder.append(innerBody.get(0).data());
} catch (Throwable e) {
LoggerUtil.error(LOGGER, e, "HtmlAnalysisUtil->getTextByInnerInframe error");
}
}
});
return stringBuilder.toString();
}
private static WebClient createWebClient() {
WebClient webClient = null;
webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setRedirectEnabled(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
return webClient;
}
@Jin9628 and you are facing this memory leak after webClient.close();?
@Jin9628
String pageXml = page.asXml();
Document document = Jsoup.parse(pageXml);
to see code like this makes me sad ;-) I think you can have everything you get from Jsoup also with HtmlUnit
Maybe you can improve your code (or give me a hint about what is missing). Serializing the page back to xml and then parse it again is not soo efficient.
@Jin9628 and you are facing this memory leak after webClient.close();?
yeah,I will improve my code. But I think it can't help me solve the problem
@Jin9628 do you have also an url for your sample to let me debug this here?
@Jin9628 do you have also an url for your sample to let me debug this here?
This is an example I found, you can debug this: url: https://webs.csjywlkj.cn/privacy-tcyx?a=1
@rbri Sorry,have you made any progress?
@Jin9628 sorry, i wrote a small test program that fetches the page and did this in a loop for 20min. But i can't see any memory leak - the profiler shows no growing in memory.
Maybe you can provide a small test program?
@Jin9628 can you please try the latest release and report your results
I encountered a problem, I used version 2.70.0 and found "com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory$TimeoutContext
net.sourceforge.htmlunit.corejs.javascript.Interpreter$CallFrame
net.sourceforge.htmlunit.corejs.javascript.ConsString” takes up almost 5G of memory. when I set webClient.getOptions().setJavaScriptEnabled(false), I don't hava this problem. I also tried the latest version 3.9.0 , but it doesn't work.