SWOT-JPL / swotsimulator

Other
33 stars 16 forks source link

simulator in the cloud #19

Closed LauraGomezNavarro closed 3 years ago

LauraGomezNavarro commented 5 years ago

Given some model data are on the cloud, we wondering on the use of the SWOT simulator being run from the cloud (@raphaeldussin @rabernat), which would be of interest for the SWOT community. The main issue we have found for now is the SWOT simulator uses netCDF as inputs and outputs. Any ideas on this? Thanks in advance!

lgaultier commented 5 years ago

Yes Laura, many ideas indeed but nothing will be done for sure, I will give more details in pm.

rabernat commented 5 years ago

Thanks for starting this discussion. I am at a meeting today and won't be able to give a long reply for a few days. I would love to apply swot simulator to our cloud-based LLC4320 data.

However, there is one clear task upon which everything else would depend: refactor the swot simulator to consume and produce xarray datasets (rather than reading and writing netCDF files). This allows us to plug in any of the other I/O backends from xarray (zarr, hdf, rasterio, grib, etc etc), and allows swot simulator to focus on the science aspects, rather than file management.

ezaron commented 5 years ago

Oh wow. I feel ~so~ dated. Was it really so long ago we were having this same discussion about endian-ness and fortran binary vs. netcdf, too? Thanks, Ryan, you really made my day.

Please note, from the xarray documentation: "NetCDF is the recommended binary serialization format for xarray objects."

All the best,

Ed

On 3/21/19 6:47 AM, Ryan Abernathey wrote:

Thanks for starting this discussion. I am at a meeting today and won't be able to give a long reply for a few days. I would love to apply swot simulator to our cloud-based LLC4320 data.

However, there is one clear task upon which everything else would depend: refactor the swot simulator to consume and produce xarray datasets (rather than reading and writing netCDF files). This allows us to plug in any of the other I/O backends from xarray (zarr, hdf, rasterio, grib, etc etc), and allows swot simulator to focus on the science aspects, rather than file management.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SWOTsimulator/swotsimulator/issues/19#issuecomment-475234775, or mute the thread https://github.com/notifications/unsubscribe-auth/AEfCmP02jEsRc0iYSVgNdAhXVW7iUe7Nks5vY41jgaJpZM4cBRa5.

--

Edward D. Zaron
Research Associate Professor
Department of Civil and Environmental Engineering
Portland State University
Portland, OR 97207-0751
Phone: (503)-725-2435
FAX: (503)-725-5950
ezaron@pdx.edu
rabernat commented 5 years ago

@ezaron I'm not sure I know how to interpret your comment or understand the context around it. Are you being sarcastic? I basically don't understand what message you wish to convey.

The point is that xarray is not a storage format. It is a symbolic data structure within python which conforms to the netCDF data model can be used with a very wide variety of backend storage formats. In the cloud, we don't use files. We access data directly from the object store. For the reasons why, I refer you to this blog post: http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

By refactoring python tools to use xarray, rather than doing their own I/O, we can factor out the details of the storage format and build a pathway to experiment with new storage formats.

ezaron commented 5 years ago

Hi Ryan,

No, I wasn't being sarcastic. Good question, though. The wording of your comments about using xarray, so that we can "focus on the science aspects, rather than file management" are reminiscent of virtually identically discussions from the mid- to late 90's about whether to move beyond fortran binary file formats (which are neither machine or compiler independent; netcdf, hdf, and their toolchains have been huge improvements over the previous status quo). It made me laugh to consider how dated I had become without realizing it. Sorry if I was not clear in my post.

I appreciate that you are pointing to a level of abstraction about data structures in the cloud which differs from simple considerations of the file format. I still need to peruse the xarray material in detail. Thanks for pointing out this possibility to the list.

All the best,

Ed

On 3/21/19 10:58 AM, Ryan Abernathey wrote:

@ezaron https://github.com/ezaron I'm not sure I know how to interpret your comment or understand the context around it. Are you being sarcastic? I basically don't understand what message you wish to convey.

The point is that xarray is not a storage format. It is a symbolic data structure within python which conforms to the netCDF data model can be used with a very wide variety of backend storage formats. In the cloud, we don't use files. We access data directly from the object store. For the reasons why, I refer you to this blog post: http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

By refactoring python tools to use xarray, rather than doing their own I/O, we can factor out the details of the storage format and build a pathway to experiment with new storage formats.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SWOTsimulator/swotsimulator/issues/19#issuecomment-475339209, or mute the thread https://github.com/notifications/unsubscribe-auth/AEfCmNrhV4DPWh7J03Yyz8Zn6DbhKxU6ks5vY8g9gaJpZM4cBRa5.

--

Edward D. Zaron
Research Associate Professor
Department of Civil and Environmental Engineering
Portland State University
Portland, OR 97207-0751
Phone: (503)-725-2435
FAX: (503)-725-5950
ezaron@pdx.edu
rabernat commented 5 years ago

Thanks for the clarification Ed! 😄

One big irony here is that the MITgcm LLC4320 did in fact output a custom fortran binary file format! So the problems of the 90s are still with us! 😱 Our solution to this is to use the xmitgcm package to read the data, which wraps the binary data in xarray data structure. At that point, it is indistinguishable from a netCDF file to downstream code.

I agree that the netCDF file format was a major breakthrough. With xarray, we are trying to keep the netCDF data model that we all love while experimenting with the underlying storage containers.

If anyone wants to see what it means to work with zarr data via xarray in the cloud, just click this button!

Binder

rabernat commented 5 years ago

Any update on this issue? I would love to be able to run swot simulator on our cloud-based Zarr datasets (e.g. https://pangeo-data.github.io/pangeo-datastore/master/ocean/llc4320.html). How can we help refactor swot simulator to relax the requirement of netCDF files as input?

lgaultier commented 5 years ago

This is an undergoing implementation, it should be implemented by the end of the year, with some parallelization capabilities more adequate for supercomputer. I will keep you posted on this post when it is finalized.

rabernat commented 5 years ago

it should be implemented by the end of the year, with some parallelization capabilities more adequate for supercomputer.

Great news! We are eager to collaborate on this. We in the Pangeo project would be very glad to help, as we have a lot of experience developing xarray-friendly packages that achieve parallelization via dask. Please let me know if there is anything specific we can do.

lgaultier commented 4 years ago

A new version more flexible regarding the input format is coming, It is a different code, with also different parameter files. The reading of files is handled by plugin so that each user can adapt their own plugin and read their data. data are then formated with xarray and provided as input to the simulator. I will communicate in a month on this new version. It will be available on the CNES Hal platform and my understanding is that any user can ask for an account on this platform.

rabernat commented 4 years ago

The reading of files is handled by plugin so that each user can adapt their own plugin and read their data.

May I politely suggest that you simply eliminate I/O from swot simulator. Let the I/O be handled by the user, and have swot simulator accept xarray datasets as its input. It's a lot of work to implement I/O for all the different possible formats out there. Xarray already supports a dozen common formats: http://xarray.pydata.org/en/stable/io.html.

The way I would like to use swotsimulator is the follwing:

import xarray
import swotsimulator

input_data = xr.open_mfdataset(list_of_my_files)
simulated_data = swotsimulator.simulate(input_data) # returns xarray dataset
lgaultier commented 3 years ago

The refactoring of the simulator for the cloud is done and has been moved to https://github.com/CNES/swot_simulator