Large dataset write speed

mfreer commented 4 years ago

Hey all,

I've been working on some code to convert binary image datasets into NetCDF format. The motivation behind this work is to aid users of these images by creating a common format for these images, which can come from multiple, esoteric formats, which are difficult for new users to process. In each file, there are a large number of images (typically several million), each having dimensions of roughly 128 x 128.

I've found that attempts to write these large datasets to NetCDF are a bit slow. I've tried various chunking schemes, but I haven't been able to achieve any significant performance increases. Are there any methods or tricks I might be missing to increase my write speeds?

Here's a very simple code snippet which shows what I'm looking to improve. On my machine (2016 MBP), the write command here of 100,000 'images' takes around 60s to complete... Any suggestions for improvement would be greatly appreciated!

#!/usr/bin/env python
# coding: utf-8

import netCDF4 as nc
import numpy
import time

rootgrp = nc.Dataset('testrun.nc', 'w')

rootgrp.createDimension('Time', None)
rootgrp.createDimension('Slices', None)
rootgrp.createDimension('Array', 128)

test1 = rootgrp.createVariable('test1', 'u2', ('Time', 'Slices', 'Array'), chunksizes=(10000, 128, 128))

a = numpy.zeros((100000, 128, 128))

print('starting write')
t0 = time.time()
test1[:] = a
print('write finished: ', time.time() - t0)

rootgrp.close()

jswhit commented 4 years ago

I'm guessing it has something to do with their being more than one unlimited dimension. Do you really need both 'Time' and 'Slices' to be unlimited?

mfreer commented 4 years ago

The 'Time' dimension will likely need to stay as an unlimited dimension, since the number of images isn't known until the dataset is fully decompressed and processed. For the 'Slices' dimension, not all images have the same number of slices, however, there is a defined maximum number possible, depending on the source of the images. From a speed point of view, would it be better if the 'Slices' dimension is set at this maximum from the beginning?

jswhit commented 4 years ago

I think having the 'Slices' dimension be fixed would speed things up considerably.

mfreer commented 4 years ago

Thanks all for the input. I've recently had a chance to do some testing, and looked at having a fixed vs unlimited slice dimension. Using the code above, there was a minor difference in write speed between the fixed and unlimited case (22 sec vs 24 sec for fixed vs unlimited), so not as significant as I was hoping.

Are there any other things that could be having an effect on the write speed? I thought perhaps the chunksize could have some impact, but I haven't found a combination that gave any significant improvement...

jswhit commented 4 years ago

Chunksizes can have a large impact on read and write speed. See https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_perf_chunking.html and https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters.

Unidata / netcdf4-python

Large dataset write speed #987