InferenceData.to_netcdf and Arviz.from_netcdf be able to take file objects or buffers

k-sys commented 4 years ago

Most libraries that allow writing or reading files also allow reading and writing to a python file object or an IO file-like object (eg. BytesIO). Currently Arviz requires writing these files to disk which creates an awkward dance if one simply wants to upload to s3, for instance. This may not seem like a big deal, but the files can be large, you have to remember to delete them off disk after you're done with them, etc.

Currently, to upload to S3, I have to do the following:

with tempfile.NamedTemporaryFile() as fp:
    inference_data.to_netcdf(fp.name)
    fp.seek(0)
    s3.Bucket("bucket-name").upload_fileobj(
        fp, "my-s3-file-key"
    )

Would be much easier to do this (without writing to disk)

with BytesIO() as buffer:
    inference_data.to_netcdf(buffer)
    s3.Bucket("bucket-name").upload_filobj(buffer, "my-s3-file-key")

Even better, I'd suggest using s3fs which is what Pandas does and support s3:// "protocol" so:

inference_data.to_netcdf("s3://bucket-name/my-s3-file-key")

However this last suggestion is not specifically related to the more important ability to write/read from a buffer or file object.

canyon289 commented 4 years ago

Some extra notes.

Usefulness of this will increase as our InferenceData objects get bigger. Was already struggling last year with InferenceData objects being too big for git so storing them somewhere else would be nice
@aloctavodia mentioned having a netcdf store somewhere. This method would make it easier to InferenceData objs there

If programmed an external lib like s3fs should be optional so users who don't use the cloud don't have to install the dependency

OriolAbril commented 4 years ago

I think this could be added upstream to xarray (not sure to what extent it is possible and docs are missing and to what extent it is not possible), https://github.com/pydata/xarray/issues/4122 issue seems related. After all, we are basically calling xarray.Dataseto_netcdf to do all the magic.

This other issue https://github.com/pydata/xarray/issues/2995 also looks related and it could make sense to look into zarr too.

canyon289 commented 4 years ago

@percygautam Not implying that you need to implement this as part of your GSOC, or go out of your way, but as you're looking around InferenceData would appreciate any thoughts or advice you have around this issue

percygautam commented 4 years ago

@canyon289 Sorry, I must have missed this tag. I'll definitely give it a try and help in any way I can.

OriolAbril commented 4 years ago

https://twitter.com/dopplershift/status/1286415993347047425 this could be helpful, in addition to the issues linked above

arviz-devs / arviz

InferenceData.to_netcdf and Arviz.from_netcdf be able to take file objects or buffers #1237