boutproject / boutdata

GNU Lesser General Public License v3.0
0 stars 2 forks source link

Option to split output in time into several files #50

Closed johnomotani closed 3 years ago

johnomotani commented 3 years ago

Adds an option time_split_size for squashoutput() that can be used to split the output of squashoutput into several files, each with a size in the time-dimension of time_split_size (except the last file, which may be smaller when the length of the time dimension is not a multiple of time_split_size). The several output files are labelled consecutively by a counter - a second option time_split_first_label can be used to set the first label to a non-zero value for convenience.

When squashoutput is writing output to multiple files, the writing is parallelised if the parallel argument is set to a non-False value. Parallel writing may increase the memory usage, so it can be disabled (while keeping parallel reading) by passing disable_parallel_write=True.

Includes tests of both output splitting and parallel writing.

johnomotani commented 3 years ago

@hahahasan - if you get the chance, could you try squashing some of your data split into several files (e.g. time_split_size=10) and see if the parallelisation speeds up the squashing at all, compared to squashing to a single file?

johnomotani commented 3 years ago

I realised the check for an existing file was broken when time_split_size was passed. Fixed now, and a test added.

johnomotani commented 3 years ago

Tests are failing because of regex matching when I checked the exception raised because a file already exists. Part of the message is the string 'squashoutput()' (with brackets). When I tested on Python-3.9, I had to make it a raw string with escaped brackets r"...squashoutput\(\)..." to avoid DeprecationWarnings, but on the GHA server, it's failing, saying

AssertionError: Regex pattern 'already exists, squashoutput\\(\\) will not overwrite.  Also, for some filenames collect may try to read from this file, which is presumably not desired behaviour.' does not match '/tmp/pytest-of-runner/pytest-0/test_core_min_files_existing_s8/boutdata.nc already exists, squashoutput() will not overwrite. Also, for some filenames collect may try to read from this file, which is presumably not desired behaviour.'.

Does anyone know a nice way to do this match that will work on Python 3.7, 3.8 and 3.9? Otherwise I'm tempted to just remove the first part of the error message so the brackets aren't there.

johnomotani commented 3 years ago

I just tested the parallelised writing, and it was actually slower than writing in serial :disappointed: So I think I'll just revert the parallelisation. It's possible that the hold-up is the copying and serialisation/deserialisation passing the arrays to the worker processes (I haven't attempted to do any timing to test this), in which case it's possible that using a shared-memory array might help. I'm not going to try that now though (or any time soon!). Random thought: if using a shared array, it would be nice to be able to allocate and reshape the shared memory array and then read directly into it to avoid copying, but that's not currently supported by boututils.DataFile, and as far as I can see also isn't supported by the netcdf4 library (although I only looked quickly) so seems impossible to implement (unless something like np_var[:] = nc_var[:] where np_var is a numpy array and nc_var is a netcdf4.Variable, but I don't know if it would or not).