Cloud-Drift / clouddrift

CloudDrift accelerates the use of Lagrangian data for atmospheric, oceanic, and climate sciences.
https://clouddrift.org/
MIT License
37 stars 8 forks source link

GLAD dataset adapter #61

Closed milancurcic closed 11 months ago

milancurcic commented 1 year ago

Part of #53.

Can be adapted from clouddrift-examples/data/glad.py into clouddrift/adapters.py.

selipot commented 1 year ago

So will we have also gdp, gdp6h, parcels etc adapters?

milancurcic commented 1 year ago

Yes, and anything else that we want and that our users ask for, assuming it's in scope.

miniufo commented 1 year ago

Hi guys, this is an interesting projects I've recently came across. I found that the ragged array data structure could also be applied to Lagrangian type of data like tropical cyclones best track datasets (see my repo). I've once tried to design a data struct (basically a wrapper of pandas.DataFrame) and adapt it to the GDP drifter dataset (6hr version, not hourly, see here). Since your ragged data struct follows the CF convention, I feel that it is much better to use this data struct to refactor my repo for tropical cyclone.

A much further thought is: is it possible to isolate the lagrangian data struct as a standalone package, like xarray, so that both GDP datasets, GLAD dataset, and tropical cyclone datasets (any specific lagrangian datasets in geoscience, including synthetic particles generated by numerical models) can be easily built on this data struct, with some additional efforts on parsing the datasets into the ragged array (using different adapters)?

Once very large dataset is being handled, how about the efficiency of ragged array? Pandas and xarray has many capabilities to deal with huge datasets (like out-of-core computation). Since the doc is still in development, I cannot know many details of your designs.

Just some thoughts here with this great package.

philippemiron commented 1 year ago

The main class of the package is designed to be used with any datasets. Look at the example notebooks here, https://github.com/Cloud-Drift/clouddrift-examples/tree/main/notebooks, in particular I think the numerical data could be adapted to your needs!

Happy to help if you have any questions.

PS: we are changing the name of the class from dataformat to raggedarray as part of #171.

milancurcic commented 1 year ago

Thanks @miniufo for your interest and ideas. To clarify the RaggedArray class is an intermediate data structure used internally to go from custom data formats -> xr.Dataset. It's not intended for use in analysis, and instead we define our Lagrangian analysis functions on the ragged array xr.Dataset. You're correct that TC tracks (and intensity and other vortex properties) are essentially Lagrangian and fit here very well.

You're welcome to use clouddrift's RaggedArray as a dependency in your library to make adapters for HURDAT2 and/or IBTracs or others.

Alternatively, we can also implement these adapters directly in clouddrift; we could work on that together if you'd like.

miniufo commented 1 year ago

@philippemiron Thanks for pointing me to the notebooks. I've spent some times trying with the RaggedArray data structure. Now I see that it is a internal thing, as mentioned by @milancurcic, and the output xr.Dataset is the key data structure users play with.

I feel a little confused why we need a internal RaggedArray? All the lagrangian dataset are stored as a txt file could easily be handled by pandas. I think pandas can play a similar role as RaggedArray and help rearrange the data into a xr.Dataset. If this is the case, I may skip the dependence of RaggedArray and rely on pandas to rearrange the raw data as a xr.Dataset as you guys designed here.

Just try to understand your design. I do like to help if I can.

philippemiron commented 1 year ago

@philippemiron Thanks for pointing me to the notebooks. I've spent some times trying with the RaggedArray data structure. Now I see that it is a internal thing, as mentioned by @milancurcic, and the output xr.Dataset is the key data structure users play with.

This is correct. Most of the analysis functions are based on xr.Dataset (some also supports pd.Series or np.array).

I feel a little confused why we need a internal RaggedArray? All the lagrangian dataset are stored as a txt file could easily be handled by pandas. I think pandas can play a similar role as RaggedArray and help rearrange the data into a xr.Dataset. If this is the case, I may skip the dependence of RaggedArray and rely on pandas to rearrange the raw data as a xr.Dataset as you guys designed here.

The idea of the RaggedArray class is to simplify this conversion. You can of course generate the ragged array yourself and use clouddrift analysis function afterwards.

In your case, if I understand correctly, you can probably just reshape the data, and create a RaggedArray object in a few lines. As Milan said, we could help you generate this, it should be easy considering it's a single .txt file.

Once you have this object, there are functions to easily convert to either an xr.Dataset, an Awkward Array, or output to a NetCDF or a parquet file.

Just try to understand your design. I do like to help if I can.

milancurcic commented 11 months ago

I haven't found a way to download the dataset (https://data.gulfresearchinitiative.org/data/R1.x134.073:0004) from the code. This is because there is no static dataset URL, but instead it's resolved dynamically via JavaScript (and quite likely server calls). We have a few options:

  1. Instruct the user to download the dataset to the local file system before running the adapter;
  2. Upload a copy of the dataset to S3 or some static source. I've used GitHub issues as file storage (you attach a file to a blank issue, close the issue, and you get a static URL to the file; however, this works for < 25MB; GLAD is 150MB)

2 would allow for a better user experience. Since the dataset is DOI'd and finalized, we could serve a copy from a place we control without worry that the upstream dataset may change. @selipot do we have an S3 bucket for the project that we could use?

selipot commented 11 months ago

We do not have a bucket but we could create one. We need to figure out the cost?

milancurcic commented 11 months ago

S3 Standard is $0.023 per GB, so for GLAD that would be $0.00345 per download, or 290 downloads per $1.

milancurcic commented 11 months ago

I now see that @philippemiron already had extracted a static URL from the backend in the GLAD example notebook. I'll check that it still works and we'll just use that if so.

milancurcic commented 11 months ago

It works; all good.

philippemiron commented 11 months ago

I think I looked at the Developer tools -> Network tabs at the time to find this direct link...! Glad to see it still works!

milancurcic commented 11 months ago

@philippemiron that's smart, I haven't thought of that, only looked in page source. :)