HtmlUnit / htmlunit

HtmlUnit is a "GUI-Less browser for Java programs".
https://www.htmlunit.org
Apache License 2.0
872 stars 171 forks source link

htmlunit 2.35 and 2.36 OSGi release (htmlunit-2.36.0-OSGi.jar) leaks threads and hangs the applications #120

Open waynexin opened 4 years ago

waynexin commented 4 years ago

I recently upgraded to 2.35 and 2.36 using htmlunit as a crawling unit. This was not happening for 2.33. After crawling a lot of pages, I started to see tons of the following threads in the thread dump and eventually it eats up system resource and hangs the container (in docker).

"WebSocketClient@1404341633-126276" #126276 daemon prio=5 os_prio=0 tid=0x00007faba4228800 nid=0xbe7 runnable [0x00007fa1ec102000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)

"WebSocketClient@1404341633-126275" #126275 daemon prio=5 os_prio=0 tid=0x00007faba403b000 nid=0xbdb waiting on condition [0x00007fa28901c000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method)

The crawling code snippet looks like following. Because certain web access could get very slow, I create a future task for the crawling and "cancel" it.

..... ExecutorService executor = Executors.newSingleThreadExecutor(); Future future = executor.submit(new HtmlunitCrawl(urlWithProto, timeout, useProxy)); ..... } finally { future.cancel(true); executor.shutdownNow();

The crawling code:

        java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
        java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
        java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);

        WebClient webClient = new WebClient(BROWSER_VERSION);
        webClient.getOptions().setTimeout(timeout*1000); 
        webClient.getOptions().setCssEnabled(false);
        webClient.getOptions().setPopupBlockerEnabled(true);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.getOptions().setRedirectEnabled(true);
        webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
        webClient.getOptions().setUseInsecureSSL(true);
        webClient.getOptions().setPrintContentOnFailingStatusCode(false);

        webClient.setJavaScriptTimeout(sJavascriptTimeout*1000);        
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.waitForBackgroundJavaScript(sJavascriptTimeout*1000);
        webClient.setScriptPreProcessor(new PandoraHtmlunitScriptPreprocessor());
        // webClient.setRefreshHandler(new ThreadedRefreshHandler());
        webClient.setRefreshHandler(new WaitingRefreshHandler(timeout));

.....

                PandoraWebConnection conn = new PandoraWebConnection(webClient);
                webClient.setWebConnection(conn);
                RedirectChain rc = new RedirectChain();
                rc.entryUrl = url;
                PandoraWebConnection.REDIRECT_TABLE.put(Thread.currentThread().getName(), rc);
                **aPage = webClient.getPage(urlWithProto);**
                Thread.sleep(sCrawlerWaitTime*1000);
                crawlResp = conn.getLastResponse();
                // System.out.println("Sleeping 6 seconds for page to fully load");
                // Thread.sleep(6000);
                PandoraHtmlunitScriptPreprocessor.RUNNING_SCRIPTS.remove(Thread.currentThread().getName());

        } finally {
            try {
                **webClient.close();**
            } catch (Exception ex) {
                ex.printStackTrace();
                            }
                   }
waynexin commented 4 years ago

for some reason, the formatting didn't turn out well. Basically, I used webClient.getPage() to crawl and in the finally block did a webClient.close().

rbri commented 4 years ago

Can you please check with Version 2.37.0. I did some changes to make the closing of WebSockets more robust.

waynexin commented 4 years ago

Sure. I'll give a try.

-Wayne


From: RBRi notifications@github.com Sent: Monday, March 2, 2020 6:03 PM To: HtmlUnit/htmlunit htmlunit@noreply.github.com Cc: waynexin wayne_xin@hotmail.com; Author author@noreply.github.com Subject: Re: [HtmlUnit/htmlunit] htmlunit 2.35 and 2.36 OSGi release (htmlunit-2.36.0-OSGi.jar) leaks threads and hangs the applications (#120)

Can you please check with Version 2.37.0. I did some changes to make the closing of WebSockets more robust.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/HtmlUnit/htmlunit/issues/120?email_source=notifications&email_token=AOIWG4WRGE4KNAC3GJGV7FLRFPYHJA5CNFSM4KGNGOIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENQKE4Y#issuecomment-593535603, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AOIWG4VFLAHTLSABNDOP3DLRFPYHJANCNFSM4KGNGOIA.

RuralHunter commented 3 years ago

I still see the same with 2.44. Not sure if this is related: https://stackoverflow.com/questions/46450721/how-do-you-close-websocketcontainer-websocketclient-jetty-client-in-java