Open amaltaro opened 2 months ago
Jenkins results:
Jenkins results:
Jenkins results:
Jenkins results:
I finally managed to do memory measurements with the code currently provided in this PR. It compares the current RucioConMon implementation in the WMCore stack in 2 scenarios: 1) fetching compressed data from RucioConMon (format=raw, generating line by line) 2) fetching uncompressed data from RucioConMon (format=json, loading the whole data in memory)
I ran these tests in vocms0259, such that I could measure memory usage in the node with the grafana host monitor. See screenshot below:
Some observations are:
As I was running memory_profiler
as well, here is a breakdown of the format=raw
vs `format=json
, where we can see that the application memory barely changes in the raw/compressed implementation; but has GBs of memory footprint in the json one (no generator).
format=raw (also faster!)
(WMAgent-2.3.4.3) [xxx@xxx:install]$ python testRucioConMonMem.py
2024-09-06 19:49:28,773:INFO:testRucioConMonMem: Fetching unmerged dump for RSE: T1_US_FNAL_Disk with compressed data: True
2024-09-06 19:49:28,788:INFO:testRucioConMonMem: Fetching data from Rucio ConMon for RSE: T1_US_FNAL_Disk.
2024-09-06 19:49:28,802:INFO:RucioConMon: Size of rseUnmerged object: 11888
2024-09-06 20:48:44,553:INFO:testRucioConMonMem: Total files received: 10877227, unique dirs: 12885
Filename: testRucioConMonMem.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
27 36.0 MiB 36.0 MiB 1 @profile
28 def getUnmergedFiles(rucioConMon, logger, compressed=False):
29 36.0 MiB 0.0 MiB 1 dirs = set()
30 36.0 MiB 0.0 MiB 1 counter = 0
31 36.0 MiB 0.0 MiB 1 logger.info("Fetching data from Rucio ConMon for RSE: %s.", RSE_NAME)
32 68.6 MiB 30.6 MiB 10877228 for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):
33 68.6 MiB 2.0 MiB 10877227 dirPath = _cutPath(lfn)
34 68.6 MiB 0.0 MiB 10877227 dirs.add(dirPath)
35 #logger.info(f"Size of dirs object: {asizeof.asizeof(dirs)}")
36 68.6 MiB 0.0 MiB 10877227 counter += 1
37 68.6 MiB 0.0 MiB 1 logger.info(f"Total files received: {counter}, unique dirs: {len(dirs)}")
38 68.6 MiB 0.0 MiB 1 return dirs
2024-09-06 20:48:44,555:INFO:testRucioConMonMem: Done!
format=json:
(WMAgent-2.3.4.3) [xxx@xxx:install]$ python testRucioConMonMem.py
2024-09-06 21:12:16,810:INFO:testRucioConMonMem: Fetching unmerged dump for RSE: T1_US_FNAL_Disk with compressed data: False
2024-09-06 21:12:16,825:INFO:testRucioConMonMem: Fetching data from Rucio ConMon for RSE: T1_US_FNAL_Disk.
2024-09-06 21:20:38,011:INFO:RucioConMon: Size of rseUnmerged object: 2812956952
2024-09-06 22:18:50,841:INFO:testRucioConMonMem: Total files received: 10877227, unique dirs: 12885
Filename: testRucioConMonMem.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
27 35.8 MiB 35.8 MiB 1 @profile
28 def getUnmergedFiles(rucioConMon, logger, compressed=False):
29 35.8 MiB 0.0 MiB 1 dirs = set()
30 35.8 MiB 0.0 MiB 1 counter = 0
31 35.8 MiB 0.0 MiB 1 logger.info("Fetching data from Rucio ConMon for RSE: %s.", RSE_NAME)
32 2946.0 MiB 24.6 MiB 10877228 for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):
33 2946.0 MiB 3.0 MiB 10877227 dirPath = _cutPath(lfn)
34 2946.0 MiB 0.2 MiB 10877227 dirs.add(dirPath)
35 #logger.info(f"Size of dirs object: {asizeof.asizeof(dirs)}")
36 2946.0 MiB 0.0 MiB 10877227 counter += 1
37 63.6 MiB -2882.4 MiB 1 logger.info(f"Total files received: {counter}, unique dirs: {len(dirs)}")
38 63.6 MiB 0.0 MiB 1 return dirs
2024-09-06 22:18:50,842:INFO:testRucioConMonMem: Done!
I will make sure these changes are reflected in https://github.com/dmwm/WMCore/pull/12059 and proceed with this development over there.
Alan, it seems to me that actual issue in accumulation of results in this for loop: for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):
. How about converting the code to generator and let client process it. You may measure the size of returned dirs and it may likely to be constant size you observe in grafana which is unavoidable. The JSON adds overhead to load the JSON data into the RAM.
The function rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed)
returns a generator to the client, and in this example the client is testRucioConMonMem.py
(I am making a similar test to what MSUnmerged does here).
With that said, this line:
for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):
is in the correct place, as here will be the place to parse each lfn and decide what to do with them (on the client side). Please let me know if I misunderstood your comment though.
Alan, the issue is in RucioConMon and Service modules. Here is my insight into its behavior:
refreshCache
API from Service module, see https://github.com/dmwm/WMCore/blob/3ea5301e7191107e646c114f71207a4e8265c67e/src/python/WMCore/Services/RucioConMon/RucioConMon.py#L73getData
callgetData
call, see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Service.py#L318json.loads
and then write it back to a file, https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Service.py#L339My proposal is to stream data from makeRequest
which should be converted to a generator, to preserve backward compatibility you need a new API for that. This way the code should not read data from remote server but rather reads it line by line and yields it back to a client. Also, to avoid converting data from list to a set (unique set of LFNs) you have two options: either make unique constrain on BE and avoid it on a client, or if you use cache file let the external process to handle uniqueness in a file, e.g. cat lfns.txt | sort -n | unique > new file.txt
and switch your cache to a new file.
From my observation the current implementation of fetching data has unavoidable memory footprint due to loading data into python after HTTP call, and the larger HTTP response the larger memory footprint the code will deal with.
Valentin, you seem to have captured well the flow of an HTTP call through the Service
parent class.
To add on what you described above, I think the actual data is loaded into memory at the most base class (pycurl_manager), at these lines: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/pycurl_manager.py#L342-L343
which can then be automatically decompressed as well, if content is in gzip
format. In that scenario, the cache file in the filesystem will not be binary as well, but it will be in the content-type from the response object (json, text, etc).
To really minimize the memory footprint, we would have to stream data from server to client, fetching 1 row each time. I don't know the exact details, but I guess the server would have to support a new-line (or similar) data streaming, the connection between client and server would have to remain opened until the client exhausts all the data in the response object.
This data-streaming is somehow a conflicting idea with the custom data caching we have implemented in the Services
python package. So it needs to be carefully thought and implemented.
What you are looking for is NDJSON data-format which server must implement, basically it is list of JSON records separated by '\n'. Doing this way client can request such data (it can be in zipped format as well) and read one record at a time, similar to how CSV data-format is processed. And, the total amount of memory required for entire set of records will be reduced to a size of a single record.
Can one of the admins verify this patch?
Fixes #
Status
<In development | not-tested | on hold | ready>
Description