Streamline call to `kerchunk.SingleHdf5ToZarr` and add some performance metrics

valeriupredoi commented 1 year ago

Description

@markgoddard found out in #140 that there is some hefty data transfer at the point where kerchunk.SingleHdf5ToZarr is called and run; that's not clear to me if that is actual data (as in, netCDF file's data member) transferred, or it's metadata, or it's something else; but I noticed that the SingleHdf5ToZarr API docs mention the need for a storage_options dict to be passed to fsspec when the file is a string, so I thought we'd create that, pass it, and use it, see if it makes any difference at all.

Closes #140 (well, at least attempts to, if testing proves otherwise, we'll take it from there)

Before you get started

[x] ☝ Create an issue to discuss what you are going to do

Checklist

[x] This pull request has a descriptive title and labels
[x] This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
[x] All tests pass

codecov[bot] commented 1 year ago

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Files	Coverage Δ
activestorage/active.py	`92.53% <100.00%> (ø)`
activestorage/netcdf_to_zarr.py	`80.43% <70.00%> (-6.53%)`	:arrow_down:

:loudspeaker: Thoughts on this report? Let us know!.

valeriupredoi commented 1 year ago

@markgoddard no more need to test with your test in #140 - I have implemented here actual pytest monitoring that measures memory used by individual tests - look at "Analyze S3 performance" in the GA test - it is clear that the S3 with AS is using almost exactly as much memory as the no AS case - this means only one thing - data loaded in the client's memory :confounded: :confounded: There is definite data transfer, most probably in kerchunk - I need to look much closer at this, in this PR

valeriupredoi commented 1 year ago

OK we got out culprit for sure - smoking gun held by SingleHdf5ToZarr:

test_Active 881.75390625
test_no_Active 877.3359375
test_s3_SingleHdf5ToZarr 928.3359375
test_local_SingleHdf5ToZarr 890.296875
List of 100 slowest tests (duration, path, name
2.89s test_s3_performance.py::test_Active
2.40s test_s3_performance.py::test_s3_SingleHdf5ToZarr
0.18s test_s3_performance.py::test_local_SingleHdf5ToZarr
0.15s test_s3_performance.py::test_no_Active

valeriupredoi commented 1 year ago

OK gents @markgoddard @bnlawrence @davidhassell I now have a full picture of where and how the resources/data transfer etc are happening in our beloved PyActiveStorage (well, in Active) - here's the memory/time chain for the case of a single netCDF4 file of 50M size, all serial, and all on a GH Actions machine (which is OK since I am the sole user of it when running the test, but numbers ma y vary from run to run by a certain small percentage); numbers are for Active._version = (the one that uses Kerchunk, and should be the closest to active storage proper:

The stack is: SingleHdf5ToZarr -> JSON dump (h5chunks.translate()).encode()) -> load Zarr Group -> have a Zarr reference file (image) -> select chunk -> send back chunk coords to AS -> get result from AS

S3 (data is in an S3 bucket on Minio)

entire Active (end-to-end): 617M / 167s
SingleHDF5ToZarr: 66M / 0.23s
JSON dump: 330M / 163s
load Zarr group (includes JSON dump): 337M / 166s

Local (data is available on disk)

entire Active (end-to-end): 881M / 7.9s
SingleHDF5ToZarr: 67M / 0.07s
JSON dump: 348M / 6.6s
load Zarr group (includes JSON dump): 495M / 7.8s

So it's pretty obvious there is data transfer at least once: at the Kerchunk chunks to JSON dump step; somewhere else along the stack after the Zarr-reference file has been loaded data is mishandled in memory leading to unnecessary memory clogging, but at least for the S3 case, it seems that the only actual data transfer to Client is happening at the JSON dump step, given that that takes the bulk of the time Active runs for. I need to figure out if/how we can get rid of the JSON dumping (ie make h5chunks.translate()).encode() not download the data), otherwise we're fairly hosed :+1:

bnlawrence commented 1 year ago

Just so I understand this. The netCDF file is compressed right? What's the grid of the data in it? What is the chunking? (I am just trying to work out why we have >300M of json from 50M of data? The json should be the kerchunk chunk index right? Is the netCDF file itself sensible chunked?)

bnlawrence commented 1 year ago

(There is also that configuration option which if set wrong, means they load all the data into the json file as an "optimisation")

valeriupredoi commented 1 year ago

@bnlawrence the file created by the vanilla creator:

        make_vanilla_ncdata(filename=s3_testfile,
                            chunksize=(3, 3, 1), n=150)

so a 150x150x150 with no compression, no filters, chunking is horrendous indeed. The JSON file itself is only 45M on disk, just a wee smaller than the actual netCDF data file (which is ~50M on disk)

bnlawrence commented 1 year ago

Is there any chance we are reading that json file every time we process a chunk, or do a test? Are those the timings per chunk, or per test, or per set of tests?

valeriupredoi commented 1 year ago

Is there any chance we are reading that json file every time we process a chunk, or do a test? Are those the timings per chunk, or per test, or per set of tests?

per test, see the test module and see the GA action under the "Analyze S3 and local test performance" step - I think I am now understanding very well what's the hap, reading kerchunk's hdf5.py module, one sec will post findings

valeriupredoi commented 1 year ago

OK running around in circles yelling "my hair is on fire" (Dexter's Laboratory reference, for the older people) phase over - understanding what the heck's going on now:

So the test I am using is this one for a local storage since I can't make use of the S3 container, but things are pretty clear already:

the file in question is https://github.com/valeriupredoi/PyActiveStorage/tree/add_storage_options/tests/s3_exploratory/test_data/test_bizarre.nc which is uncompressed, vanilla netCDF4, of about 50M in size - importantly, it is chunked in (3, 3, 1), and apparently that makes it be of 7 (seven) HDF5 groups.

kerchunk's hdf.py is the one that has SingleHdf5ToZarr class and this is the one that's being used to open the file and convert it to a Zarr reference file. It "translates" the file from HDF5 to Zarr via a translate(self) method that does two main things:

moves the attributes from an HDF5 to a Zarr set of attributes, that's done via self._transfer_attrs(self._h5f, self._zroot)
populates the new Zarr reference file with "data" - that's NOT the actual file's data, but chunk information of the type (Zarr-typed) (x y z): base64(offset, size) - this is done via self._h5f.visititems(self._translator) where _translator is a callable, and visititems is a h5py method that takes the callable and returns what the callable asks for, see https://docs.h5py.org/en/stable/high/group.html#h5py.Group.visititems - this results in the creation of cinfo - a massive dict that contains all manners of UTF8 metadata and a large collection of chunk key-val pairs of (x y z): base64(offset, size)
these things then get dumped to the JSON file; the reason why a JSON file is about 45M for an actual netCDF4 file of 50M is that the JSON contains a shitton of these (x y z): base64(offset, size) (in binary)

Running the test on the file produces this timing table:

Translation begins
Transfer attrs took: 0.0016644001007080078

Running translator
cinfo (part of visititems) took: 1.9593455791473389
Transfer attrs took: 0.0005290508270263672
chunk info storaging took: 1.9480032920837402

Running translator
cinfo (part of visititems) took: 5.841255187988281e-05
Transfer attrs took: 0.0005819797515869141
chunk info storaging took: 1.71661376953125e-05

Running translator
cinfo (part of visititems) took: 4.1484832763671875e-05

Running translator
cinfo (part of visititems) took: 6.198883056640625e-05
Transfer attrs took: 0.00061798095703125
chunk info storaging took: 1.7642974853515625e-05

Running translator
cinfo (part of visititems) took: 8.726119995117188e-05

Running translator
cinfo (part of visititems) took: 5.2928924560546875e-05
Transfer attrs took: 0.0005898475646972656
chunk info storaging took: 1.621246337890625e-05

Running translator
cinfo (part of visititems) took: 4.00543212890625e-05

Visit items took: 4.660032510757446
Trsnafer attrs and visit items took: 4.661792993545532
json encode took: 0.13840699195861816
Translate took: 4.800270318984985

Translator ran 7 times for 7 HDF5 groups - it seems that just the first group is the big one - at any rate, visiting items is absolutely dominant in time for this particular file - a nasty recursive operation that's done on the open file; I don't know how this scales to S3 hosted files, but looking at the duration of tests above this operation is VERY slow for S3 files (20x slower than for local POSIX)

The way _translator is calling _storage_info is probably not optimized for object stores, note that even the HDF5 open file call should use the ros3 kwarg for S3 files.

In conclusion: phew, no actual netCDF4 file's actual data is transferred, but the use of kerchunk's HDF5 to Zarr mapper either needs to be fixed upstream (at least for S3 data) or needs to vanish altogether, bc the performance is abysmal

valeriupredoi commented 1 year ago

holy mother of Jeebus! Rechunking from (3, 3, 1) to (10, 10, 10) for the same 150x150x150 file gets us HUGE speedups and memory saves: see https://github.com/valeriupredoi/PyActiveStorage/actions/runs/6312570050/job/17138886865?pr=146 -

S3 (data is in an S3 bucket on Minio)

entire Active (end-to-end): 87M / 0.55s
SingleHDF5ToZarr: 60M / 0.17s
JSON dump: 64M / 0.49s
load Zarr group (includes JSON dump): 65M / 0.51s

Local (data is available on disk)

entire Active (end-to-end): 98M / 0.14s
SingleHDF5ToZarr: 60M / 0.05s
JSON dump: 64M / 0.09s
load Zarr group (includes JSON dump): 67M / 0.11s

valeriupredoi commented 1 year ago

same results when running locally, now with detailed Kerchunk timing:

3-3-1 chunking

Result

a = {'n': 8, 'sum': array([[[1355404.]]])}, axis = None, dtype = None, out = None, keepdims = False

Kerchunk timing

Kerchunking file /home/valeriu/PyActiveStorage/tests/s3_exploratory/test_data/test_bizarre.nc with variable data for storage type None
Storage type None
Translation begins
Transfer attrs took: 0.001699686050415039
Running translator
cinfo (part of visititems) took: 2.6562392711639404
H5 object Kind f
Transfer attrs took: 0.0005955696105957031
chunk info storaging took: 3.091123580932617
Running translator
cinfo (part of visititems) took: 6.866455078125e-05
H5 object Kind i
Transfer attrs took: 0.0007214546203613281
chunk info storaging took: 2.0742416381835938e-05
Running translator
cinfo (part of visititems) took: 8.440017700195312e-05
H5 object Kind f
Running translator
cinfo (part of visititems) took: 7.963180541992188e-05
H5 object Kind i
Transfer attrs took: 0.0007498264312744141
chunk info storaging took: 2.3365020751953125e-05
Running translator
cinfo (part of visititems) took: 6.747245788574219e-05
H5 object Kind f
Running translator
cinfo (part of visititems) took: 6.818771362304688e-05
H5 object Kind i
Transfer attrs took: 0.0006330013275146484
chunk info storaging took: 2.2649765014648438e-05
Running translator
cinfo (part of visititems) took: 5.054473876953125e-05
H5 object Kind f
Visit items took: 6.826631307601929
Trsnafer attrs and visit items took: 6.828451633453369
json encode took: 0.20699334144592285
Translate took: 7.035499334335327

10-10-10 chunking

Result

a = {'n': 8, 'sum': array([[[1355404.]]])}, axis = None, dtype = None, out = None, keepdims = False

Kerchunk timing:

Kerchunking file /home/valeriu/PyActiveStorage/tests/s3_exploratory/test_data/test_bizarre.nc with variable data for storage type None
Storage type None
Translation begins
Transfer attrs took: 0.005974292755126953
Running translator
cinfo (part of visititems) took: 0.028586149215698242
H5 object Kind f
Transfer attrs took: 0.0006480216979980469
chunk info storaging took: 0.015633106231689453
Running translator
cinfo (part of visititems) took: 7.009506225585938e-05
H5 object Kind i
Transfer attrs took: 0.0007061958312988281
chunk info storaging took: 1.7404556274414062e-05
Running translator
cinfo (part of visititems) took: 5.125999450683594e-05
H5 object Kind f
Running translator
cinfo (part of visititems) took: 6.389617919921875e-05
H5 object Kind i
Transfer attrs took: 0.0004954338073730469
chunk info storaging took: 3.24249267578125e-05
Running translator
cinfo (part of visititems) took: 3.910064697265625e-05
H5 object Kind f
Running translator
cinfo (part of visititems) took: 4.3392181396484375e-05
H5 object Kind i
Transfer attrs took: 0.0004894733428955078
chunk info storaging took: 1.52587890625e-05
Running translator
cinfo (part of visititems) took: 3.790855407714844e-05
H5 object Kind f
Visit items took: 0.07573843002319336
Trsnafer attrs and visit items took: 0.08198189735412598
json encode took: 0.0013606548309326172
Translate took: 0.08354067802429199

valeriupredoi commented 1 year ago

Chunks (75, 75, 75) doubles speedup :clown_face:

valeriupredoi commented 1 year ago

So more test results: in latest test I am creating a 500x500x500 data file with chunks=(75, 75, 75) - the file (which I also created locally) is 1.1G in size. Local AS takes 0.1s and needs 100M of RES memory; S3 AS takes 0.28s and needs 83M of RES memory - numbers check out since I ran local AS on my laptop and it took 0.3s (slightly longer since I am printing stuff to screen - printing takes a lot in these very short cases) and needed 100M of RES mem :partying_face:

@markgoddard I am afraid of abusing the GA machine with large files (I think about 2G is the most I can use), maybe it'd be good if you created a fairly hefty file (10G maybe?) and put it on Minio and tested with it, when you get a bit of time - chunking is absolutely paramount so maybe a (200, 200, 200) chunks for that? :beer:

valeriupredoi commented 1 year ago

chaps @bnlawrence @davidhassell @markgoddard - I'd like to merge this now so we have the tests available in main (for deliverable and other formal reasons, but also, so we can continue looking at performance); incidentally, this PR also cleans up the way we call SingleHdf5ToZarr from Kerchunk (it actually increases performance this way!) - could I ask one of yous to have a looksee and give me a review please - don't have to be super careful review, this doesn't break anything :grin:

valeriupredoi commented 1 year ago

I'll merge this for now - we can always PR improvements :grin:

NCAS-CMS / PyActiveStorage