HtmlUnit / htmlunit

HtmlUnit is a "GUI-Less browser for Java programs".
https://www.htmlunit.org
Apache License 2.0
867 stars 172 forks source link

Memory leak (even after webclient.close()) #729

Open fleboulch opened 7 months ago

fleboulch commented 7 months ago

Hello,

I want to thank you for your amazing work. I'm using your lib since almost 1 year now and it's really nice.

I'm having an issue about memory (heap memory).

Showcase 1: I'm starting my app without doing any scrap

Heap: 74Mo image

Showcase 2: I'm starting my app and doing 1 scrap with close

Heap: 256Mo image

class ArkeaArenaFetcher {

    fun fetch(): List<EventJpa> {
        val webClient = WebClient().apply {
            options.isCssEnabled = true
            options.isJavaScriptEnabled = true
            cssErrorHandler = SilentCssErrorHandler()
            javaScriptErrorListener = SilentJavaScriptErrorListener()
            options.isThrowExceptionOnFailingStatusCode = false
        }

        return try {

            val page = webClient.getPage<HtmlPage>("https://www.arkeaarena.com/fr/programmation/tous-les-evenements/#")
            webClient.waitForBackgroundJavaScript(4000)
            val container: HtmlElement = page.getFirstByXPath("//div[@class='events-list ajaxed']/div[@class='container']")
            val rawEvents = container.getByXPath<HtmlElement>("a")
            rawEvents.map(::htmlToInfra)
        } catch (e: Exception) {
            emptyList()
        } finally {
            webClient.close()
        }
    }

     private fun htmlToInfra(html: HtmlElement): EventJpa {
        // convert html to Kotlin object
        ...
     }

}

Showcase 3: I'm starting my app and doing 1 scrap with close + other clean + gc

Heap: 166Mo image

The code is the same as the showcase 2 but only the finally clause is changing like below

        finally {
            webClient.cache.clear()
            webClient.topLevelWindows.forEach { it.close(false) }
            webClient.topLevelWindows.forEach { it.jobManager.removeAllJobs() }
            webClient.cookieManager.clearCookies()
            webClient.close()
            System.gc()
        }

The issue here is even when I'm closing the webclient instance there is still memory which is not released. Here in my example code I'm dealing with a single source but in production I'm dealing with multiple sources.

I also tried

Other info

Article read about the memory subject:

Similar issues

rbri commented 7 months ago

@fleboulch - first of all - great to see that this is of some use for you; thanks for the feedback

can you please add

webClient.cookieManager.clearCookies()

to your second case, because this is not part of the close process.

And can you please try HtmlUnit 3.11.0....

rbri commented 7 months ago

The issue here is even when I'm closing the webclient instance there is still memory which is not released. Here in my example code I'm dealing with a single source but in production I'm dealing with multiple sources.

I think there is a lot of things that are created and stored - but i think the point is: if you create a webClient several times and do some scraping, after closing the client the memory should go back to the level after the first round....

fleboulch commented 7 months ago

I would like to use 3.11.0 version but my suite test is failing since 3.10.0. I added a comment here

fleboulch commented 7 months ago

Yes you are correct! Even with a single webclient instance the memory is rising quite fast and in production I don't have a huge setup (1Go memory)

fleboulch commented 7 months ago

Second issue found when trying to migrate from 3.9.0 to 3.11.0 (comment) Issue has been introduced in 3.10.0

fleboulch commented 7 months ago

Hello @rbri,

I'm seeing you are preparing a 4.0.0 version. That's a great news ! Did you have time to check the regressions I mentionned in my comments here?

fleboulch commented 6 months ago

I tried v4.0.0 and regressions I mentionned earlier disappeared!
Thanks for the amazing work @rbri :tada:
Nevertheless, I still have my base issue with memory leak (I tried some stuff you told me above but it's not working)

rbri commented 4 months ago

@fleboulch sorry for the long pause

The issue here is even when I'm closing the webclient instance there is still memory which is not released. Here in my example code I'm dealing with a single source but in production I'm dealing with multiple sources.

There are some internal (class based) caches that might be the reason.

I think a valid test scenario looks like this

So far the theory - will try to find some time to check the code again.

fleboulch commented 4 months ago

Thanks for your reply @rbri ! I really appreciate your deep investigation. I will try your scenario on my code to check if your assumptions are true. I'm using different webclients because at the beginning I was parallelizing the calls

fleboulch commented 4 months ago

I checked your comment and it seems correct!
On my app I need to scrap multiple sites/external sources and I don't need any cache mechanism (even more after a close). I'm scrapping these websites once a day and currently the memory used stays high.
What are your recommandations for my use case?

fleboulch commented 1 week ago

Hello @rbri, Do you have some news about this issue?