georust / netcdf

High-level netCDF bindings for Rust
Apache License 2.0
81 stars 28 forks source link

Slow disk IO #134

Closed kiranshila closed 4 months ago

kiranshila commented 4 months ago

Hey there! Not sure if this is an issue per say, but I'd like some advise with serializing large files to disk. Right now, I'm constructing a file on the order of 8GB, which takes ~20s to write to disk. Given the NVME drive in this machine, I'd expect that to take more like 8s. I know the standard tricks here in Rust IO land of using a buffered writer don't apply since the netcdf library is handling IO, so I'm wondering what I could do to improve performance.

Here is the code where I do this:

https://github.com/GReX-Telescope/GReX-T0/blob/9c757f41db9f8b0ed006a7bd3cecfdc31a85ddbb/src/dumps.rs#L42-L128

Thanks!

lnicola commented 4 months ago

I don't have much experience with NetCDF, but you could try a gdal_translate or gdalmdimtranslate to get a rough idea of the time it takes to make a copy of that file. Of course, that will be double the I/O. You could also look at the compression and chunking settings, but I don't expect compression to make it faster on NVMe.

kiranshila commented 4 months ago

Ah, great! I'll play with set_chunking

lnicola commented 4 months ago

Or try saving to a tmpfs :-).

EDIT: or https://crates.io/crates/zarrs.

kiranshila commented 4 months ago

Ooo tmpfs isn't a bad idea. Right now I'm writing into /tmp and then spawning a thread to copy to the final destination

mulimoen commented 4 months ago

Try to make the writes as large as possible and avoid copying (does pl.into_ndarray() allocate?). Depending on the size of CHANNELS you might have to do a bit more work chunking things up before writing to netCDF to avoid overhead. Aim for chunks ~20MB or ideally larger.

kiranshila commented 4 months ago

Great suggestions, thank you! Yeah for a bit more context, I'm writing out this structured complex time/polarization/frequency data from a radio telescope. Each time slice is 8K of data (at 8us time resolution), so writing out 1s is about 1G. I'm saving the high time resolution data in a ring buffer (that struct in the linked file) and then an external trigger starts the serialization to disk. Right now I'm going row by row in time because the ring buffer isn't in order - I have to just follow the read ptr, but those are only 8K chunks, nowhere near your recommended 20M. Thinking a bit more about this, I suppose I could always chunk the buffer into two halves: read_ptr..end and then 0..read_ptr. So, I might give that a shot, but I need to think how to do that while avoiding allocations.

kiranshila commented 4 months ago

This then begs the question, if I set voltages.set_chunking(&[1, 2, CHANNELS, 2]) does that require that the call to put is that shape? Or does that chunking happen elsewhere?

mulimoen commented 4 months ago

Chunking puts no requirements on put but performance will be affected. Internally netCDF will for an oversized chunk size have to retrieve a partial chunk, append the data then write it back, leading to write amplification. Undersized chunk size will lead to many calls to write, leading to many extra calls to write the data.

kiranshila commented 4 months ago

Makes sense, thank you. I'm going to try to reorganize this a bit to remove some allocations and then try to write in two large chunks as described above.

kiranshila commented 4 months ago

So I wrote a little minimal test program, and creating the two consecutive ring buffer chunks and writing with chunking made a huge difference! I also removed a bunch of unnecessary allocations. Code here for the interested. I also benchmarked over a few different chunk sizes (y axis here is chunks size in my time axis, so multiply those numbers by 8192 bytes to get the chunk size) image

All good stuff and now matching the performance I expected. Thank you for your help everyone!