HtmlUnit / htmlunit

HtmlUnit is a "GUI-Less browser for Java programs".
https://www.htmlunit.org
Apache License 2.0
879 stars 172 forks source link

Memory leak - version 3.9.0 #695

Open Jin9628 opened 11 months ago

Jin9628 commented 11 months ago

I encountered a problem, I used version 2.70.0 and found "com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory$TimeoutContext

net.sourceforge.htmlunit.corejs.javascript.Interpreter$CallFrame

net.sourceforge.htmlunit.corejs.javascript.ConsString” takes up almost 5G of memory. when I set webClient.getOptions().setJavaScriptEnabled(false), I don't hava this problem. I also tried the latest version 3.9.0 , but it doesn't work.

rbri commented 11 months ago

do you have minimal sample to let me reproduce this?

Jin9628 commented 11 months ago

do you have minimal sample to let me reproduce this?

This is the demo that I used:

public static String getTextByHtmlUrl(String htmlUrl) {
    if (StringUtils.isBlank(htmlUrl)) {
        return StringUtils.EMPTY;
    }
    WebClient webClient = createWebClient();
    HtmlPage page = null;
    String text = "";
    try {
        page = webClient.getPage(htmlUrl);
        webClient.waitForBackgroundJavaScript(LOAD_BACKGROUND_JAVASCRIPT_TIME);
        String pageXml = page.asXml();
        Document document = Jsoup.parse(pageXml);
        Elements body = document.select(BODY);
        text = body.get(0).text();
        Elements iframes = document.select(IFRAME);
        if (CollectionUtils.isEmpty(iframes)) {
            return text;
        }
        return getTextByInnerInframe(text, htmlUrl, iframes, webClient);
    } catch (Throwable e) {
        LoggerUtil.warn(LOGGER, "HtmlAnalysisUtil->getTextByHtmlUrl error");
        return StringUtils.EMPTY;
    } finally {
        webClient.close();
    }
}

private static String getTextByInnerInframe(String sourceText, String sourceUrl, Elements iframes, WebClient webClient) {
    StringBuilder stringBuilder = new StringBuilder();
    stringBuilder.append(sourceText);
    iframes.stream().forEach(iframe -> {
        String iframeUrl = iframe.attr(SRC);
        if (StringUtils.isNotBlank(iframeUrl)) {
            iframeUrl = buildValidUrl(sourceUrl, iframeUrl);
            try {
                HtmlPage innerPage = webClient.getPage(iframeUrl);
                webClient.waitForBackgroundJavaScript(LOAD_BACKGROUND_JAVASCRIPT_TIME);
                String innerPageXml = innerPage.asXml();
                Document innerDocument = Jsoup.parse(innerPageXml);
                Elements innerBody = innerDocument.select(BODY);
                stringBuilder.append(innerBody.get(0).data());
            } catch (Throwable e) {
                LoggerUtil.error(LOGGER, e, "HtmlAnalysisUtil->getTextByInnerInframe error");
            }
        }
    });
    return stringBuilder.toString();
}

private static WebClient createWebClient() {
    WebClient webClient = null;
    webClient = new WebClient(BrowserVersion.CHROME);
    webClient.getOptions().setRedirectEnabled(false);
    webClient.getOptions().setJavaScriptEnabled(true);
    webClient.getOptions().setActiveXNative(false);
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    return webClient;
}
rbri commented 11 months ago

@Jin9628 and you are facing this memory leak after webClient.close();?

rbri commented 11 months ago

@Jin9628

String pageXml = page.asXml();
Document document = Jsoup.parse(pageXml);

to see code like this makes me sad ;-) I think you can have everything you get from Jsoup also with HtmlUnit

Maybe you can improve your code (or give me a hint about what is missing). Serializing the page back to xml and then parse it again is not soo efficient.

Jin9628 commented 11 months ago

@Jin9628 and you are facing this memory leak after webClient.close();?

yeah,I will improve my code. But I think it can't help me solve the problem

rbri commented 11 months ago

@Jin9628 do you have also an url for your sample to let me debug this here?

Jin9628 commented 10 months ago

@Jin9628 do you have also an url for your sample to let me debug this here?

This is an example I found, you can debug this: url: https://webs.csjywlkj.cn/privacy-tcyx?a=1

Jin9628 commented 10 months ago

@rbri Sorry,have you made any progress?

rbri commented 10 months ago

@Jin9628 sorry, i wrote a small test program that fetches the page and did this in a loop for 20min. But i can't see any memory leak - the profiler shows no growing in memory.

Maybe you can provide a small test program?

rbri commented 7 months ago

@Jin9628 can you please try the latest release and report your results