GLAD dataset adapter - Githubissues

milancurcic commented 1 year ago

Part of #53.

Can be adapted from clouddrift-examples/data/glad.py into clouddrift/adapters.py.

selipot commented 1 year ago

So will we have also gdp, gdp6h, parcels etc adapters?

milancurcic commented 1 year ago

Yes, and anything else that we want and that our users ask for, assuming it's in scope.

miniufo commented 1 year ago

Hi guys, this is an interesting projects I've recently came across. I found that the ragged array data structure could also be applied to Lagrangian type of data like tropical cyclones best track datasets (see my repo). I've once tried to design a data struct (basically a wrapper of pandas.DataFrame) and adapt it to the GDP drifter dataset (6hr version, not hourly, see here). Since your ragged data struct follows the CF convention, I feel that it is much better to use this data struct to refactor my repo for tropical cyclone.

A much further thought is: is it possible to isolate the lagrangian data struct as a standalone package, like xarray, so that both GDP datasets, GLAD dataset, and tropical cyclone datasets (any specific lagrangian datasets in geoscience, including synthetic particles generated by numerical models) can be easily built on this data struct, with some additional efforts on parsing the datasets into the ragged array (using different adapters)?

Once very large dataset is being handled, how about the efficiency of ragged array? Pandas and xarray has many capabilities to deal with huge datasets (like out-of-core computation). Since the doc is still in development, I cannot know many details of your designs.

Just some thoughts here with this great package.

philippemiron commented 1 year ago

The main class of the package is designed to be used with any datasets. Look at the example notebooks here, https://github.com/Cloud-Drift/clouddrift-examples/tree/main/notebooks, in particular I think the numerical data could be adapted to your needs!

Happy to help if you have any questions.

PS: we are changing the name of the class from dataformat to raggedarray as part of #171.

milancurcic commented 1 year ago

Thanks @miniufo for your interest and ideas. To clarify the RaggedArray class is an intermediate data structure used internally to go from custom data formats -> xr.Dataset. It's not intended for use in analysis, and instead we define our Lagrangian analysis functions on the ragged array xr.Dataset. You're correct that TC tracks (and intensity and other vortex properties) are essentially Lagrangian and fit here very well.

You're welcome to use clouddrift's RaggedArray as a dependency in your library to make adapters for HURDAT2 and/or IBTracs or others.

Alternatively, we can also implement these adapters directly in clouddrift; we could work on that together if you'd like.

miniufo commented 1 year ago

@philippemiron Thanks for pointing me to the notebooks. I've spent some times trying with the RaggedArray data structure. Now I see that it is a internal thing, as mentioned by @milancurcic, and the output xr.Dataset is the key data structure users play with.

I feel a little confused why we need a internal RaggedArray? All the lagrangian dataset are stored as a txt file could easily be handled by pandas. I think pandas can play a similar role as RaggedArray and help rearrange the data into a xr.Dataset. If this is the case, I may skip the dependence of RaggedArray and rely on pandas to rearrange the raw data as a xr.Dataset as you guys designed here.

Just try to understand your design. I do like to help if I can.

philippemiron commented 1 year ago

@philippemiron Thanks for pointing me to the notebooks. I've spent some times trying with the RaggedArray data structure. Now I see that it is a internal thing, as mentioned by @milancurcic, and the output xr.Dataset is the key data structure users play with.

This is correct. Most of the analysis functions are based on xr.Dataset (some also supports pd.Series or np.array).

I feel a little confused why we need a internal RaggedArray? All the lagrangian dataset are stored as a txt file could easily be handled by pandas. I think pandas can play a similar role as RaggedArray and help rearrange the data into a xr.Dataset. If this is the case, I may skip the dependence of RaggedArray and rely on pandas to rearrange the raw data as a xr.Dataset as you guys designed here.

The idea of the RaggedArray class is to simplify this conversion. You can of course generate the ragged array yourself and use clouddrift analysis function afterwards.

In your case, if I understand correctly, you can probably just reshape the data, and create a RaggedArray object in a few lines. As Milan said, we could help you generate this, it should be easy considering it's a single .txt file.

Once you have this object, there are functions to easily convert to either an xr.Dataset, an Awkward Array, or output to a NetCDF or a parquet file.

Just try to understand your design. I do like to help if I can.