Closed adbar closed 2 years ago
Running your example with memray
, it seems also related to lxml:
0 : 43.53125
1 : 44.6875
2 : 45.953125
3 : 47.3515625
4 : 48.1015625
5 : 49.23828125
6 : 51.14453125
7 : 51.765625
8 : 53.2890625
9 : 54.8046875
I found it, it wasn't lxml as such, rather a combination with a function cache.
Using master, the increase is acceptable now, good job. Can't wait for the new release.
It's out :heavy_check_mark:
Have met this bug as well tonight through trafilatura. This bug was making the Python process very difficult to scale in number of pages crawled. Are you sure about the remaining lru_cache calls, i.e., that their function arguments do not involve a lot of memory (eg non-small HTML strings or lxml trees as argument)? Thx
@dmoklaf Yes the main problem has been addressed, there are no lxml trees as argument anymore. Besides, a few other lru_cache
calls have been removed.
You may have to update htmldate
manually in order to benefit from the fix, a new trafilatura
version should be out soon.
Thanks - I have done that last night but still have regular memory increases (with a few thousand docs I reach 1gb quickly, and from there it keeps growing).
The use of global variables to hold data (in this case, the dictionaries hidden behind @lru_cache in both htmldata and trafilatura) is causing this and preventing me from saying to the process "get rid of all your caches" once the crawling is done.
If in the next major version you refactor htmldate and trafiltura's code, it might be beneficial to encapsulate all the caches data structures (in fact, everything) in an object. This way, when the client of your library is done, he just gets rid of this object and has a 100% guarantee that all the memory is freed up (no global variables holding user data staying in the background).
Thanks for your input, the idea seems interesting, could you please point me to examples of the solution you describe?
In the meantime, you can try to set the CACHE_SIZE
value in settings.py
much lower (say 2 instead of 8192).
I have forced CACHE_SIZE=2 in both htmldate and trafilatura, and I can confirm that the memory growth is gone. It's definitely the various caches which are growing very large and can't be cleared afterward.
A clean encapsulation into an object would look like this (illustrated here for Trafilatura, but same principle for both libs). Please read the very last sentence because it simplifies A LOT the work to be done (to something that I think can be done in 1 hour of work):
Externally, client of the library don't call global functions anymore. They first create an Extractor object with the proper cache configuration. This object provides public methods to do the job:
extractor = trafilatura.Extractor(caches_size=1024)
for document in documents:
extraction = extractor.extract(document)
...
When clients are done with the extraction, they just get rid of this Extractor object. As this object carries all the caching structure (see below), it guarantees them all the memory will be freed up upon the next garbage collection
You thus define this class with a tiny skeleton that does nothing:
class Extractor:
def __init__(self, caches_size=4096):
pass
You remove all @lru_cache tags
You move all public and private functions to methods of this new class - prefix all the private methods, like trim (not to be called by your clients) with a single underscore _:
class Extractor:
...
def _some_private_method(self, ...):
...
trimmed_string = self._trim(string)
...
def _trim(self, text):
...
You verify that it works as before (but may be slow as no caching is going on)
You add back the caches with the following trick: in the constructor of Extractor, you wrap the methods you want to cache using the good old lru_cache:
class Extractor:
...
def __init__(self, caches_size=4096):
self._cached_trim = functools.lru_cache(maxsize=caches_size)(self._trim)
In all places that used to call self.trim() you decide whether you want to call instead the cached version self._cached_trim - I believe in some cases you may not want to do that (I take trim as an example because it is called from many places and I am not sure some callers are not wasting cache resources for very long strings that will never reappear):
class Extractor:
...
def _some_private_method(self, ...):
...
trimmed_string = self._cached_trim(string)
...
The advantage of this approach is thus that, beyond guaranteeing zero-memory footprint once the extraction is done, you give yourself the capability to selectively decide whether to use caching or not on a per-call basis
Your class may grow very large if you migrate all functions to methods - but in fact you only need to migrate the functions that need to be cached - the others can stay were they are, as private global functions
Hi @dmoklaf, thanks for the detailed explanations!
The extractor class makes perfect sense but the trick with lru_cache
looks a bit convoluted.
Would you be interested in drafting a pull request with just the extractor class and the trimming function? I could go on from there.
As a workaround, and maybe a simpler alternative, I have just cleared the cache myself every 1000 documents using the @lru_cache cache_clear() method.
I currently hit directly all the cached function, which makes my code fragile (as these might change):
def clear_htmldate_caches()-> None:
htmldate.core.compare_reference.cache_clear()
htmldate.extractors.try_date_expr.cache_clear()
htmldate.validators.date_validator.cache_clear()
htmldate.validators.filter_ymd_candidate.cache_clear()
def clear_trafilatura_caches()-> None:
trafilatura.utils.line_processing.cache_clear()
trafilatura.utils.trim.cache_clear()
This could become a target solution if each library were to provide such a public clear_caches()
function. No need for an object wrapper and does the job.
The advantage of this approach is that you (as the designer of the library) keep control over the cache sizes - it's not one-size-fits-all, some caches may need to be larger or smaller depending on how often each function is called and the frequency of its specific arguments.
What do you think?
Maybe we could just change/set a custom cache size easily when both librairies are imported?
Ah that's a good solution. Unfortunatly @lru_cache is called at import time. So once imported it's too late, the caches are already set and can't be adjusted. So the solution would be to rebuild them through a utility function.
We could have for each library a utility function that resets (including clearing) the caches to the appropriate size.
def trim(...):
... (core code of trim, no caching)
cached_trim = lru_cache(maxsize=utils.MAXSIZE)(trim)
def reset_caches(maxsize=utils.MAXSIZE):
utils.cached_trim = functools.lru_cache(maxsize=maxsize)(utils.trim)
...
utils.trim(mystring)
becomes
utils.cached_trim(mystring)
The advantage of this approach is that, like the object approach, the library developer keeps the possibility in specific cases to call trim
instead of cached_trim
. For example in specific places where he knows that the string is very large and would waste the process memory for nothing. I suspect this was one of the reason for the very large memory consumption (but I may be wrong). That gives a lot of flexibility.
Yes, either a reset_caches()
function, or two functions trim()
/cached_trim()
and an additional argument use_cache
for the extraction functions?
Yes both solutions fit the bill.
Regarding the second solution, having the user provide an additional use_cache
argument has a development drawback for you: you would have to propagate this argument in each and every function call, plus performif...else
statements each time to determine which version to use of the cached/noncached functions. A simpler approach is to provide the user with an optional maxsize
argument in a reset_cache
function (like the first solution, but here with this optional argument):
reset_caches(maxsize=0)
reset_caches
with the appropriate maxsizereset_caches()
This function just rebuilds the cached version of the utility functions:
def reset_caches(maxsize=utils.MAXSIZE):
utils.cached_trim = functools.lru_cache(maxsize=maxsize)(utils.trim)
...
That way, a single function answers all the needs and you don't have to propagate a flag everywhere.
First solution is more limited functionnally but sufficient too (for my needs).
PS: I tried to track the memory usage using the Python tracemalloc module and I couldn't find the culprit. This is because tracemalloc tracks Python memory usage and not C modules memory usage, like lxml. So I think you are still caching lxml arguments somewhere. I am not sure this is useful (ie will the cache ever bring back a cached value? what is your use case for that where this would occur?) but you know your code much better than me
Yes, this solution would be easier and bundling all functions using lru_cache
in a reset_caches()
function should do the trick. Besides, function_name.cache_clear()
is enough.
Since caching LXML trees was a problem I already checked and I think it's not happening anymore. So the memory use should come from somewhere else. I found that the URL manipulation library (courlan
) was using a bit of memory and trimmed it down in the last version.
For the rest I don't know but please share your results if you find something!
@adbar So with the new release 1.3.0 and the addition of clear_caches()
, what is the best way to avoid memory leaks? Use this after every find_date()
call?
Edit: By the way, in the changelog it is called clear_caches()
but I can only find reset_caches()
in recent commits.
I have not used yet the new code, I have only used my own workaround in my code (waiting for this fix). The workaround works similarly, clearing all the caches, but in both libraries (trafilatura and htmldate), while this fix seems to focus on htmldate and its charset_normalizer dependency only (which is not in my workaround).
I call my workaround function to clear the cache every 1000 documents parsed. The outcome of my workaround which has been running quite a lot is that clearning the caches works perfectly. I have been using it for couple hundred thousand documents without any unbounded growth in memory anymore.
So my current understanding is that these were not leaks, but just over-use of caches (a "default" configuration of the caches intended to use several gigabytes of memory if never cleared).
Using latest version:
from htmldate.meta import reset_caches
reset_caches()
I'm getting the following error:
AttributeError Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 reset_caches()
Input In [8], in reset_caches()
23 try:
24 encoding_languages.cache_clear()
---> 25 is_suspiciously_successive_range.cache_clear()
26 is_accentuated.cache_clear()
27 # prevent possible changes in function names
AttributeError: 'function' object has no attribute 'cache_clear'
@kinoute Do you have the latest version of charset_normalizer
? It could be that you have an older version of it where the function doesn't use a LRU cache yet.
As @dmoklaf says you're free to use the function whenever you see fit, for example every n URLs. A similar function for Trafilatura will follow (see https://github.com/adbar/trafilatura/issues/219).
@adbar Indeed, upgrading to the latest version of charset_normalizer
fixed the issue. Thanks!
Can't wait for the same feature in Trafilatura!
Edit: Just in case somebody has the same problem, there was a conflict between requests
and charset_normalizer
. requests
needed a charset version of ~ 2.0.0. You need to upgrade to requests
to 2.28.1 which is compatible with charset_normalizer
2.1.0.
See issue https://github.com/adbar/trafilatura/issues/216.
Extracting the date from the same web page multiple times shows that the module is leaking memory, this doesn't appear to be related to
extensive_search
:tracemalloc doesn't give any clue.