adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line
https://htmldate.readthedocs.io
Apache License 2.0
118 stars 26 forks source link

Memory leak #56

Closed adbar closed 2 years ago

adbar commented 2 years ago

See issue https://github.com/adbar/trafilatura/issues/216.

Extracting the date from the same web page multiple times shows that the module is leaking memory, this doesn't appear to be related to extensive_search:

import os
import psutil
from htmldate import find_date

with open('test.html', 'rb') as inputf:
    html = inputf.read()

for i in range(10):
    result = find_date(html, extensive_search=False)
    process = psutil.Process(os.getpid())
    print(i, ":", process.memory_info().rss / 1024 ** 2)

tracemalloc doesn't give any clue.

kinoute commented 2 years ago

Running your example with memray, it seems also related to lxml:

0 : 43.53125
1 : 44.6875
2 : 45.953125
3 : 47.3515625
4 : 48.1015625
5 : 49.23828125
6 : 51.14453125
7 : 51.765625
8 : 53.2890625
9 : 54.8046875
Screenshot 2022-06-15 à 13 21 28
adbar commented 2 years ago

I found it, it wasn't lxml as such, rather a combination with a function cache.

kinoute commented 2 years ago

Using master, the increase is acceptable now, good job. Can't wait for the new release.

adbar commented 2 years ago

It's out :heavy_check_mark:

dmoklaf commented 2 years ago

Have met this bug as well tonight through trafilatura. This bug was making the Python process very difficult to scale in number of pages crawled. Are you sure about the remaining lru_cache calls, i.e., that their function arguments do not involve a lot of memory (eg non-small HTML strings or lxml trees as argument)? Thx

adbar commented 2 years ago

@dmoklaf Yes the main problem has been addressed, there are no lxml trees as argument anymore. Besides, a few other lru_cache calls have been removed.

You may have to update htmldate manually in order to benefit from the fix, a new trafilatura version should be out soon.

dmoklaf commented 2 years ago

Thanks - I have done that last night but still have regular memory increases (with a few thousand docs I reach 1gb quickly, and from there it keeps growing).

The use of global variables to hold data (in this case, the dictionaries hidden behind @lru_cache in both htmldata and trafilatura) is causing this and preventing me from saying to the process "get rid of all your caches" once the crawling is done.

If in the next major version you refactor htmldate and trafiltura's code, it might be beneficial to encapsulate all the caches data structures (in fact, everything) in an object. This way, when the client of your library is done, he just gets rid of this object and has a 100% guarantee that all the memory is freed up (no global variables holding user data staying in the background).

adbar commented 2 years ago

Thanks for your input, the idea seems interesting, could you please point me to examples of the solution you describe?

In the meantime, you can try to set the CACHE_SIZE value in settings.py much lower (say 2 instead of 8192).

dmoklaf commented 2 years ago

I have forced CACHE_SIZE=2 in both htmldate and trafilatura, and I can confirm that the memory growth is gone. It's definitely the various caches which are growing very large and can't be cleared afterward.

A clean encapsulation into an object would look like this (illustrated here for Trafilatura, but same principle for both libs). Please read the very last sentence because it simplifies A LOT the work to be done (to something that I think can be done in 1 hour of work):

  1. Externally, client of the library don't call global functions anymore. They first create an Extractor object with the proper cache configuration. This object provides public methods to do the job:

    extractor = trafilatura.Extractor(caches_size=1024)
    for document in documents:
    extraction = extractor.extract(document)
    ...
  2. When clients are done with the extraction, they just get rid of this Extractor object. As this object carries all the caching structure (see below), it guarantees them all the memory will be freed up upon the next garbage collection

  3. You thus define this class with a tiny skeleton that does nothing:

    class Extractor:
    
        def __init__(self, caches_size=4096):
            pass
  4. You remove all @lru_cache tags

  5. You move all public and private functions to methods of this new class - prefix all the private methods, like trim (not to be called by your clients) with a single underscore _:

    class Extractor:
        ...
        def _some_private_method(self, ...):
            ...
            trimmed_string = self._trim(string)
            ...
    
        def _trim(self, text):
            ...
  6. You verify that it works as before (but may be slow as no caching is going on)

  7. You add back the caches with the following trick: in the constructor of Extractor, you wrap the methods you want to cache using the good old lru_cache:

    class Extractor:
        ...
        def __init__(self, caches_size=4096):
            self._cached_trim = functools.lru_cache(maxsize=caches_size)(self._trim)
  8. In all places that used to call self.trim() you decide whether you want to call instead the cached version self._cached_trim - I believe in some cases you may not want to do that (I take trim as an example because it is called from many places and I am not sure some callers are not wasting cache resources for very long strings that will never reappear):

    class Extractor:
        ...
        def _some_private_method(self, ...):
            ...
            trimmed_string = self._cached_trim(string)
            ...
  1. The advantage of this approach is thus that, beyond guaranteeing zero-memory footprint once the extraction is done, you give yourself the capability to selectively decide whether to use caching or not on a per-call basis

  2. Your class may grow very large if you migrate all functions to methods - but in fact you only need to migrate the functions that need to be cached - the others can stay were they are, as private global functions

adbar commented 2 years ago

Hi @dmoklaf, thanks for the detailed explanations!

The extractor class makes perfect sense but the trick with lru_cache looks a bit convoluted. Would you be interested in drafting a pull request with just the extractor class and the trimming function? I could go on from there.

dmoklaf commented 2 years ago

As a workaround, and maybe a simpler alternative, I have just cleared the cache myself every 1000 documents using the @lru_cache cache_clear() method.

I currently hit directly all the cached function, which makes my code fragile (as these might change):

def clear_htmldate_caches()-> None:
        htmldate.core.compare_reference.cache_clear()
        htmldate.extractors.try_date_expr.cache_clear()
        htmldate.validators.date_validator.cache_clear()
        htmldate.validators.filter_ymd_candidate.cache_clear()

def clear_trafilatura_caches()-> None:
        trafilatura.utils.line_processing.cache_clear()
        trafilatura.utils.trim.cache_clear()

This could become a target solution if each library were to provide such a public clear_caches() function. No need for an object wrapper and does the job.

The advantage of this approach is that you (as the designer of the library) keep control over the cache sizes - it's not one-size-fits-all, some caches may need to be larger or smaller depending on how often each function is called and the frequency of its specific arguments.

What do you think?

kinoute commented 2 years ago

Maybe we could just change/set a custom cache size easily when both librairies are imported?

dmoklaf commented 2 years ago

Ah that's a good solution. Unfortunatly @lru_cache is called at import time. So once imported it's too late, the caches are already set and can't be adjusted. So the solution would be to rebuild them through a utility function.

We could have for each library a utility function that resets (including clearing) the caches to the appropriate size.

  1. The cached function code is slightly adjusted in each code file:
def trim(...):
    ... (core code of trim, no caching)

cached_trim = lru_cache(maxsize=utils.MAXSIZE)(trim)
  1. A global function is added to allow a client of the library to reset the caches
def reset_caches(maxsize=utils.MAXSIZE):
    utils.cached_trim = functools.lru_cache(maxsize=maxsize)(utils.trim)
    ...
  1. Individual calls within the library to cached functions are renamed accordingly:
utils.trim(mystring)
becomes
utils.cached_trim(mystring)

The advantage of this approach is that, like the object approach, the library developer keeps the possibility in specific cases to call trim instead of cached_trim. For example in specific places where he knows that the string is very large and would waste the process memory for nothing. I suspect this was one of the reason for the very large memory consumption (but I may be wrong). That gives a lot of flexibility.

adbar commented 2 years ago

Yes, either a reset_caches() function, or two functions trim() /cached_trim() and an additional argument use_cache for the extraction functions?

dmoklaf commented 2 years ago

Yes both solutions fit the bill.

Regarding the second solution, having the user provide an additional use_cache argument has a development drawback for you: you would have to propagate this argument in each and every function call, plus performif...else statements each time to determine which version to use of the cached/noncached functions. A simpler approach is to provide the user with an optional maxsize argument in a reset_cache function (like the first solution, but here with this optional argument):

This function just rebuilds the cached version of the utility functions:

def reset_caches(maxsize=utils.MAXSIZE):
    utils.cached_trim = functools.lru_cache(maxsize=maxsize)(utils.trim)
    ...

That way, a single function answers all the needs and you don't have to propagate a flag everywhere.

First solution is more limited functionnally but sufficient too (for my needs).

PS: I tried to track the memory usage using the Python tracemalloc module and I couldn't find the culprit. This is because tracemalloc tracks Python memory usage and not C modules memory usage, like lxml. So I think you are still caching lxml arguments somewhere. I am not sure this is useful (ie will the cache ever bring back a cached value? what is your use case for that where this would occur?) but you know your code much better than me

adbar commented 2 years ago

Yes, this solution would be easier and bundling all functions using lru_cache in a reset_caches() function should do the trick. Besides, function_name.cache_clear() is enough.

Since caching LXML trees was a problem I already checked and I think it's not happening anymore. So the memory use should come from somewhere else. I found that the URL manipulation library (courlan) was using a bit of memory and trimmed it down in the last version. For the rest I don't know but please share your results if you find something!

kinoute commented 2 years ago

@adbar So with the new release 1.3.0 and the addition of clear_caches(), what is the best way to avoid memory leaks? Use this after every find_date() call?

Edit: By the way, in the changelog it is called clear_caches() but I can only find reset_caches() in recent commits.

dmoklaf commented 2 years ago

I have not used yet the new code, I have only used my own workaround in my code (waiting for this fix). The workaround works similarly, clearing all the caches, but in both libraries (trafilatura and htmldate), while this fix seems to focus on htmldate and its charset_normalizer dependency only (which is not in my workaround).

I call my workaround function to clear the cache every 1000 documents parsed. The outcome of my workaround which has been running quite a lot is that clearning the caches works perfectly. I have been using it for couple hundred thousand documents without any unbounded growth in memory anymore.

So my current understanding is that these were not leaks, but just over-use of caches (a "default" configuration of the caches intended to use several gigabytes of memory if never cleared).

kinoute commented 2 years ago

Using latest version:

from htmldate.meta import reset_caches
reset_caches()

I'm getting the following error:

AttributeError                            Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 reset_caches()

Input In [8], in reset_caches()
     23 try:
     24     encoding_languages.cache_clear()
---> 25     is_suspiciously_successive_range.cache_clear()
     26     is_accentuated.cache_clear()
     27 # prevent possible changes in function names

AttributeError: 'function' object has no attribute 'cache_clear'
adbar commented 2 years ago

@kinoute Do you have the latest version of charset_normalizer? It could be that you have an older version of it where the function doesn't use a LRU cache yet.

As @dmoklaf says you're free to use the function whenever you see fit, for example every n URLs. A similar function for Trafilatura will follow (see https://github.com/adbar/trafilatura/issues/219).

kinoute commented 2 years ago

@adbar Indeed, upgrading to the latest version of charset_normalizer fixed the issue. Thanks!

Can't wait for the same feature in Trafilatura!

Edit: Just in case somebody has the same problem, there was a conflict between requests and charset_normalizer. requests needed a charset version of ~ 2.0.0. You need to upgrade to requests to 2.28.1 which is compatible with charset_normalizer 2.1.0.