earth-mover / icechunk

Open-source, cloud-native transactional tensor storage engine
https://icechunk.io
Apache License 2.0
291 stars 17 forks source link

PetaByte-scale virtual ref performance benchmark #401

Open TomNicholas opened 1 day ago

TomNicholas commented 1 day ago

I tried committing a PetaByte's worth of virtual references into Icechunk from VirtualiZarr.

It took about 16 minutes for a local store, and but crashed after consuming >100GB of RAM when trying to write the same references to an S3 store. Writing 0.1PB of virtual references to an S3 store took 2 minutes and 16GB of RAM.

Notebook here - I successfully ran the test with the local store locally on my Mac, but for some reason trying to write the same data to the S3 store blew my local memory, and then also blew the memory of a 128GB Coiled notebook too 🤯

This represents about the largest dataset someone might conceivably want to create virtual reference to and commit in one go, e.g. if someone wanted to generate virtual references to all of ERA5 and commit them to an icechunk store. My test here has 100 million chunks, and for comparison the [C]Worthy OAE Atlas dataset is ~5 million chunks.

I have not yet attempted to tackle the problem of actually generating references from thousands of netCDF files and combining them onto one worker here (xref https://github.com/zarr-developers/VirtualiZarr/issues/7 and https://github.com/zarr-developers/VirtualiZarr/issues/123) - I just manually created fake manifests that point at imaginary data and committed those to icechunk. So this test represents only the final step of a real virtual ingestion workflow.

So far no effort has gone into optimizing this on the VirtualiZarr side - we currently have a serial loop over every reference in every manifest, calling store.set_virtual_ref millions of times, so this should be viewed as the worst case to improve upon.

As my example has 100 10TB virtual variables we may be able to get a big speedup in the vds.virtualizarr.to_icechunk() step immediately just by writing each variable to icechunk asynchronously within virtualizarr. However that step only takes half the overall time, with store.commit() taking the other half, and that's entirely on icechunk. When experimenting the time taken to write and commit seemed to be proportional to the manifest size, which would make sense.

The RAM usage when writing to the S3 store is odd - watching the indicator in the Coiled notebook it used 81.5GB of RAM to write the references into the store, then died trying to commit them 😕 I don't know why (a) the RAM usage would be any larger than when writing to the local store or (b) why it would ever need to get much larger than 3.2GB, which is what it takes virtualizarr to store the references in memory.

Note also that the manifests in the local store take up 16GB on disk, which also seems large. (I could presumably have written out the references by converting the numpy arrays inside the ChunkManifest objects to .npz format and used <=3.2GB on disk.)

xref https://github.com/zarr-developers/VirtualiZarr/issues/104

rabernat commented 1 day ago

Tom thanks so much for sharing this example. Just the sort of real-world use case we need to advance scalability.

rabernat commented 1 day ago

For anyone who may come across this thread, I want to clarify that these results are totally expected. We have spent zero time so far trying to optimize for this use case. There are enormous low hanging fruit we can use to address the problems that Tom found and make Icechunk scale to handle this problem very well.

TomNicholas commented 1 day ago

^ completely agree - I'm actually quite happy that it was even able to handle 0.1PB without any effort to improve performance on the virtualizarr side. The main performance bottleneck there is probably the dumb serial for loop in virtualizarr .to_icechunk(), which we (@mpiannucci) knew was just a placeholder to get it working correctly before optimizing.

paraseba commented 1 day ago

This is awesome @TomNicholas ! Thank you so much. 0.1PB is beyond what I expected to be able to handle with our current single manifest approach. Things will improve very fast once we have some time to optimize that. Regarding commit time, yes, currently the process is far from optimized for these huge commits, I'm quite surprised we can even handle it.

Thank you for trying all this!