Closed veenstrajelmer closed 2 months ago
What is very interesting is that using a with
statement resolves all this. This time including merging:
import os
import glob
import xugrid as xu
import xarray as xr
import datetime as dt
from time import sleep
def open_part_ds(file_nc_list, withwith):
print(f'>> xu.open_dataset() with {len(file_nc_list)} partition(s): ',end='')
dtstart = dt.datetime.now()
partitions = []
for iF, file_nc_one in enumerate(file_nc_list):
print(iF+1,end=' ')
if withwith:
with xr.open_mfdataset(file_nc_one, chunks="auto") as ds_one:
uds_one = xu.core.wrap.UgridDataset(ds_one)
partitions.append(uds_one)
else:
ds_one = xr.open_mfdataset(file_nc_one, chunks="auto")
uds_one = xu.core.wrap.UgridDataset(ds_one)
# ds_one.close()
# uds_one.close()
partitions.append(uds_one)
print(': ',end='')
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
print('>> xu.merge_partitions(): ',end='')
dtstart = dt.datetime.now()
uds = xu.merge_partitions(partitions)
print(f'{(dt.datetime.now()-dtstart).total_seconds():.2f} sec')
return uds
dir_model = r"p:\11210284-011-nose-c-cycling\runs_fine_grid\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\B05_waq_2012_PCO2_ChlC_NPCratios_DenWat_stats_2023.01\DFM_OUTPUT_DCSM-FM_0_5nm_waq"
file_nc_pat = os.path.join(dir_model, "DCSM-FM_0_5nm_waq_0*_map.nc")
file_nc_list_all = glob.glob(file_nc_pat)
file_nc_list = file_nc_list_all[:5]
uds = open_part_ds(file_nc_list, withwith=False)
sleep(2)
withwith=False
:
withwith=True
:
Or withwith=False
and ds.close()
(uds.close()
does not do the trick):
From this it can be concluded that it is wise to close the original xarray dataset if not using it anymore. The time/memory consumption by merging will be unaffected by this. I will at least pick this up in https://github.com/Deltares/dfm_tools/issues/968, but it might also be good to add it to the xugrid documentation. Adding it to xu.open_dataset()
has no added benefit, since users might use the returned uds partition directly (e.g. for removing ghost cells or so), and in that case the memory consumption will be back.
If the user also does another action (like plotting a single timestep) on the merged dataset, the memory usage increases again to the usage that we saw without ds.close()
. This is documented in https://github.com/Deltares/dfm_tools/issues/968 and a clean version in https://github.com/Deltares/dfm_tools/issues/484. Therefore, it seems not useful to close the datasets after all. Furthermore, it is clear that engine="h5netcdf"
consumes way less memory (40 MB instead of 110 MB per partition), but xr.open_dataset()
showed to be way slower for datasets with many variables like in this example. This might be fixed by https://github.com/h5netcdf/h5netcdf/issues/195.
Since this is not an issue with xugrid, this issue can be closed.
Running the following script called
memory_usage.py
with memory_profiler viamprof run python memory_usage.py
andmprof plot
:Results in this memory usage:
However, when commenting
partitions.append(uds_one)
, we get way less memory usage and we see garbage collection in action:The accumulating memory consumption upon appending is inconvenient, since we want to make a list of partitions for
xu.merge_partitions()
. When callinggc.collect()
afterxr.open_dataset()
(or elsewhere), this does not make a difference.Might be related to: