Closed johnomotani closed 3 years ago
@hahahasan - if you get the chance, could you try squashing some of your data split into several files (e.g. time_split_size=10
) and see if the parallelisation speeds up the squashing at all, compared to squashing to a single file?
I realised the check for an existing file was broken when time_split_size
was passed. Fixed now, and a test added.
Tests are failing because of regex matching when I checked the exception raised because a file already exists. Part of the message is the string 'squashoutput()'
(with brackets). When I tested on Python-3.9, I had to make it a raw string with escaped brackets r"...squashoutput\(\)..."
to avoid DeprecationWarning
s, but on the GHA server, it's failing, saying
AssertionError: Regex pattern 'already exists, squashoutput\\(\\) will not overwrite. Also, for some filenames collect may try to read from this file, which is presumably not desired behaviour.' does not match '/tmp/pytest-of-runner/pytest-0/test_core_min_files_existing_s8/boutdata.nc already exists, squashoutput() will not overwrite. Also, for some filenames collect may try to read from this file, which is presumably not desired behaviour.'.
Does anyone know a nice way to do this match
that will work on Python 3.7, 3.8 and 3.9? Otherwise I'm tempted to just remove the first part of the error message so the brackets aren't there.
I just tested the parallelised writing, and it was actually slower than writing in serial :disappointed:
So I think I'll just revert the parallelisation. It's possible that the hold-up is the copying and serialisation/deserialisation passing the arrays to the worker processes (I haven't attempted to do any timing to test this), in which case it's possible that using a shared-memory array might help. I'm not going to try that now though (or any time soon!). Random thought: if using a shared array, it would be nice to be able to allocate and reshape the shared memory array and then read directly into it to avoid copying, but that's not currently supported by boututils.DataFile
, and as far as I can see also isn't supported by the netcdf4
library (although I only looked quickly) so seems impossible to implement (unless something like np_var[:] = nc_var[:]
where np_var
is a numpy array and nc_var
is a netcdf4.Variable
, but I don't know if it would or not).
Adds an option
time_split_size
forsquashoutput()
that can be used to split the output ofsquashoutput
into several files, each with a size in the time-dimension oftime_split_size
(except the last file, which may be smaller when the length of the time dimension is not a multiple oftime_split_size
). The several output files are labelled consecutively by a counter - a second optiontime_split_first_label
can be used to set the first label to a non-zero value for convenience.Whensquashoutput
is writing output to multiple files, the writing is parallelised if theparallel
argument is set to a non-False
value. Parallel writing may increase the memory usage, so it can be disabled (while keeping parallel reading) by passingdisable_parallel_write=True
.Includes tests of
bothoutput splittingand parallel writing.