Open TomNicholas opened 1 day ago
Tom thanks so much for sharing this example. Just the sort of real-world use case we need to advance scalability.
For anyone who may come across this thread, I want to clarify that these results are totally expected. We have spent zero time so far trying to optimize for this use case. There are enormous low hanging fruit we can use to address the problems that Tom found and make Icechunk scale to handle this problem very well.
^ completely agree - I'm actually quite happy that it was even able to handle 0.1PB without any effort to improve performance on the virtualizarr side. The main performance bottleneck there is probably the dumb serial for loop in virtualizarr .to_icechunk()
, which we (@mpiannucci) knew was just a placeholder to get it working correctly before optimizing.
This is awesome @TomNicholas ! Thank you so much. 0.1PB is beyond what I expected to be able to handle with our current single manifest approach. Things will improve very fast once we have some time to optimize that. Regarding commit time, yes, currently the process is far from optimized for these huge commits, I'm quite surprised we can even handle it.
Thank you for trying all this!
I tried committing a PetaByte's worth of virtual references into Icechunk from VirtualiZarr.
It took about 16 minutes for a local store, and but crashed after consuming >100GB of RAM when trying to write the same references to an S3 store. Writing 0.1PB of virtual references to an S3 store took 2 minutes and 16GB of RAM.
Notebook here - I successfully ran the test with the local store locally on my Mac, but for some reason trying to write the same data to the S3 store blew my local memory, and then also blew the memory of a 128GB Coiled notebook too 🤯
This represents about the largest dataset someone might conceivably want to create virtual reference to and commit in one go, e.g. if someone wanted to generate virtual references to all of ERA5 and commit them to an icechunk store. My test here has 100 million chunks, and for comparison the [C]Worthy OAE Atlas dataset is ~5 million chunks.
I have not yet attempted to tackle the problem of actually generating references from thousands of netCDF files and combining them onto one worker here (xref https://github.com/zarr-developers/VirtualiZarr/issues/7 and https://github.com/zarr-developers/VirtualiZarr/issues/123) - I just manually created fake manifests that point at imaginary data and committed those to icechunk. So this test represents only the final step of a real virtual ingestion workflow.
So far no effort has gone into optimizing this on the VirtualiZarr side - we currently have a serial loop over every reference in every manifest, calling
store.set_virtual_ref
millions of times, so this should be viewed as the worst case to improve upon.As my example has 100 10TB virtual variables we may be able to get a big speedup in the
vds.virtualizarr.to_icechunk()
step immediately just by writing each variable to icechunk asynchronously within virtualizarr. However that step only takes half the overall time, withstore.commit()
taking the other half, and that's entirely on icechunk. When experimenting the time taken to write and commit seemed to be proportional to the manifest size, which would make sense.The RAM usage when writing to the S3 store is odd - watching the indicator in the Coiled notebook it used 81.5GB of RAM to write the references into the store, then died trying to commit them 😕 I don't know why (a) the RAM usage would be any larger than when writing to the local store or (b) why it would ever need to get much larger than 3.2GB, which is what it takes virtualizarr to store the references in memory.
Note also that the manifests in the local store take up 16GB on disk, which also seems large. (I could presumably have written out the references by converting the numpy arrays inside the
ChunkManifest
objects to.npz
format and used <=3.2GB on disk.)xref https://github.com/zarr-developers/VirtualiZarr/issues/104