Closed ZedThree closed 2 months ago
Is it possible to get the test data created in ram instead of on disk?
It seems like it should be possible. One possibility (?):
boututils.datafile.DataFile.__init__()
so that it gets the implementation from a dict
that we can then monkey-patch a new implementation into.boututils.datafile.DataFile
. Monkey patch it onto DataFile
.Edit: Although the part I haven't figured out with this: how to pass the 'file' to collect
... That could actually be a show-stopper.
Edit2: The way we got around this for xbout
was to handle a list of xarray.Dataset
passed to the datapath
argument of open_boutdataset()
. Could possibly also do this for collect
by allowing a list of DataFile
to be passed - not sure how much of collect()
's logic this would bypass...
There's this package https://pypi.org/project/memory-tempfile/, which might allow doing file I/O on a RAM-disk (on Linux, seems to be possible to provide a fall-back to disk-based I/O for other OSes).
For me they are even slower:
====================== 591 passed in 62823.58s (17:27:03) ======================
I got it to run faster locally on my laptop, having it all on a ram disk, there it finished in 10 hours. But that might be due to some failures and not finishing:
============================= test session starts ==============================
platform linux -- Python 3.12.0, pytest-7.4.2, pluggy-1.3.0
rootdir: /tmp/boutdata
plugins: anyio-3.7.1
collected 591 items
boutdata/tests/test_boutoptions.py ................ [ 2%]
boutdata/tests/test_collect.py ......................................... [ 9%]
........................................................................ [ 21%]
........................................................................ [ 34%]
........................................................................ [ 46%]
........................................................................ [ 58%]
........................................................................ [ 70%]
...............................................F..F..F..F..F..F..F..F..F [ 82%]
..F..F..
real 648m38.772s
user 499m23.204s
sys 101m39.977s
This is on Fedora rawhide, on fedora 38 they are still "fast" and finish within a reasonable time frame ...
Removing gc.collect()
from squashing makes it much faster ...
Is that there as a workaround for xarray's aggressive caching?
boutdata should not be using xarray?
I think it was put there as a (premature?) optimisation to minimise memory usage ...
Coming back to the original thread, I doubt moving IO to a ramdisk or similar does help, we spend 30 minutes in userspace, and only 1 minute in sys:
real 31m17.721s
user 30m35.586s
sys 1m3.874s
The unit tests take about 30 minutes to complete, mostly it's
collect
. Maybe we can mock out the netcdf calls to something much faster?