JuliaClimate / meta

For discussions about JuliaClimate implementations
MIT License
6 stars 1 forks source link

State of the Julia climate stack & desired additions #2

Open gaelforget opened 4 years ago

gaelforget commented 4 years ago

In relation to #1 but somewhat distinct, let's discuss the current stack of packages and functionalities

Topics could include:

gaelforget commented 4 years ago

In a slightly different context I put a list together that is not unrelated @ https://github.com/PraCTES/MIT-PraCTES/issues/17

Datseris commented 4 years ago

I believe I speak for everyone when I say that I desire a stable and well-accepted interface for dimensional data, with a quality maching e.g. Python's xarray. I am currently contributing to and using DimensionalData.jl. It is not mature yet, and also not part of an org yet.

I personally only talk about low-level dimensional data format; a "dedicated format for geo-related data" that e.g. GeoData.jl tries to do is definitely not for me. I always prefer working with basic structs where I decide what is what and can be shared with scientists that use other programming languages (the more basic your data structure, the easier it is to share and communicate). Although I would imagine that something like GeoData.jl, but stable, well accepted and part of an organization would be helpful to many.

Having GeoMakie.jl in an improved state would be awesome. The lead dev @asinghvi17 is an awesome person I've worked with in the past, and I am sure we will work together more on GeoMakie.jl, which I will soon start contributing. I also plan to make interactive applications based on GeoMakie.jl. Only a few issues keep me back at the moment before I start contributing there, and I am sure they will be resolved fast.

ClimateTools has a lot of convenient things, but I think we could add flexibility on what is the core underlying data representation (and how detached it is from other packages). I have some methods I am currently writing for my project that I would be very happy to contribute, e.g. inter-annual variabilities.

natgeo-wong commented 4 years ago

Also, I feel a lot of attention for the org now is going into efficient data-interface and handling, but I also think there should also be discussion on the more applied aspects of what's being created here, such as data and analysis from different missions, and analysis of models that climate scientists use. (tag @briochemc because I've noticed he's created packages for analysis of oceanographic mission data, which is relevant)

I'm less on the data-interface side, and more on coding up things like

And I usually code with those goals in mind, so a lot of my functions are relatively general (e.g. I don't use the ClimGrid structure, partially because I manipulate the data in the backend as raw data arrays), and store meta-information in textfiles.

I also do random things like GillMatusno.jl which allows for the exploration of the solutions to tropical heating on a beta plane (similar to @milankl's ShallowWater.jl), but maybe these packages should eventually be organised separately in another place dedicated for idealised/simple models, or something.

One more thing I need to do is get documentation up for ease of use, but I still haven't figured out how Documenter.jl works - it's on my todo list tho.

hdrake commented 4 years ago

In response to @meggart's comment in the master thread: https://github.com/JuliaClimate/meta/issues/1#issuecomment-578127265

I would not know how to efficiently do this workflow in xarray (fitting a PCA on multivariate time series for every pixel in a gloabl dataset)

This is fairly straightforward with the eofs package, which leverages xarray. I've used it and found it incredibly easy to do EOF analysis on my dataset, despite not knowing what an EOF / PCA was at the beginning of the day.

Balinus commented 4 years ago

In response to @meggart's comment in the master thread: #1 (comment)

I would not know how to efficiently do this workflow in xarray (fitting a PCA on multivariate time series for every pixel in a gloabl dataset)

This is fairly straightforward with the eofs package, which leverages xarray. I've used it and found it incredibly easy to do EOF analysis on my dataset, despite not knowing what an EOF / PCA was at the beginning of the day.

This is interesting to see the pattern: People are developing things on a common data structure relevant for climate studies (which is based on N-dimension arrays). Define a common API for that in Julia and then we can develop things related on those data structures much more easily.

One thing that will show Julia strength is the addition of the modelization in the pipeline. In most high-level language we have something like (python example):

  1. Results of a simulation (Fortran) -> 2. xarray (Python) -> 3. xarray compatible library for specific analysis (EOF, Python) -> 4. matplotlib (Python)

where the simulation is usually a model in Fortran, etc... Now, what I see emerging in Julia is a lot of model development (see for example climate-machine). How can we leverage this for science? A quick access to a modular Julia-coded model is certainly something that we should promote. Hence, one strength of Julia is the availability (in the sense that it is Julia source code) and performance of models. I just think this accelerate the work that is done by students and academics in general (I had one PhD colleague that spent months trying to launch Fortran-based climate model with some custom grid -> debugging was a nightmare). Anyway, that's mostly random thoughts I had today. Perhaps it's not relevant and not something we should aim for at the beginning. Or perhaps that's a good starting point, I don't know.

hdrake commented 4 years ago

It seems to me like our best bet right now is to build tools around something like ESDL.jl (see their Pangeo data example), which seems to have the core functionality of xarray: labelled n-dimensional out-of-memory datasets w/ indexing and operations broadcasting, etc. (not sure how much it can all be distributed).

The underlying infrastructure behind all of this seems a bit unstable, however, since the julia community doesn't seem to have settled on a named-array / named-indexing package and even the ESDL.jl devs seem like they may shift their own array types to depend on a package like DimensionalData.jl. Alternatives are AxisArrays.jl (note: even NetCDF.jl may be deprecated in favor of NCDatasets.jl).

Maybe I'm missing something, I need to dig a bit deeper into these repos (unfortunately ESDL.jl documentation is broken for me).

hdrake commented 4 years ago

Define a common API for that in Julia and then we can develop things related on those data structures much more easily.

Agreed @Balinus, this is key. Right now I don't even know where to start because there are so many different labelled array types that you could build climate tools around, but it is not clear to me which features each have and if any of them have all the features we would want in a package to rally around.

My recent paper https://github.com/hdrake/AbyssalFlow was so much easier because both the GCM and my post-processing are in julia. (Unfortunately, the model itself is only coded to run in serial and very, very far from optimized).

Balinus commented 4 years ago

It seems to me like our best bet right now is to build tools around something like ESDL.jl (see their Pangeo data example), which seems to have the core functionality of xarray: labelled n-dimensional out-of-memory datasets w/ indexing and operations broadcasting, etc. (not sure how much it can all be distributed).

I also think we should look closely at ESDL and try to extend the package further to meet our common needs as most boxes are checked imho.

Agreed @Balinus, this is key. Right now I don't even know where to start because there are so many different labelled array types that you could build climate tools around, but it is not clear to me which features each have and if any of them have all the features we would want in a package to rally around.

Indeed! I'm using AxisArrays in ClimateTools and was quite happy with it. However, not sure it's possible to build out-of-core arrays with labels. How are labels used in ESDL? @meggart

edit - Also found this package ChunkedArrayBase.

meggart commented 4 years ago

This is fairly straightforward with the eofs package, which leverages xarray. I've used it and found it incredibly easy to do EOF analysis on my dataset, despite not knowing what an EOF / PCA was at the beginning of the day.

I don't think so. Please look at the example, this is not a standard eof analysis. Here we fit a new PCA for every single pixel, and the reduced dimension is not time but the different variables. I think the clou in Julia is also that you could simply swap the PCA with a nonlinear DR method or anything else. To summarize, what I think makes ESDL.jl attractive is that you have nice mapslices syntax for really arbitrary code, you don't have to rely on the fact that someone has already wrapped and vectorized your use case.

meggart commented 4 years ago

BTW, sorry for being so slow in replying these days. I am currently putting a lot of effort into DiskArrays.jl, which I hope will eventually give a big improvement in the way disk-mapped arrays can be treated inside the Julia-ecosystem.

After working with ESDL.jl for a while I think that treating climate data should feel as natural as possible. When currently using ESDL.jl one still has the feeling to be inside a framework, so you have separate data types and functions for everything. Simple things like broadcasting syntax, sums/means over dimensions etc are all possible but suffer from the fact that they need a different syntax than what one would expect from Base Julia. So really hope that as soon as we have stable DiskArrays, we can wrap them into other package implementing a labelled array and base our computing on these.

Currently ESDL.jl comes with its own labelled array type, but I would be happy to support other implementations of labelled arrays as well. My idea to get there was that we define a set of traits for Dimensional Arrays that just defines empty functions for querying the dimension names and dimension values for every axis.

So processing and plotting packages can query the coordinates of every point through the common interface but don't have to be specific on the actual data type they are operating on. This way, packages like ESDL.jl could operate on a variety of data types, as long as they implement the labelled array interface. I once made a gist to propose such an interface and after some discussion it resulted in this package https://github.com/JuliaGeo/DimensionalArrayTraits.jl which contains a lot of ideas but lacks a clear philosophy. I think any work towards a common interface for labelled array data types might have a huge impact on the interoperability of different packages and approaches inside the comunity.

Balinus commented 4 years ago

It seems to me like our best bet right now is to build tools around something like ESDL.jl (see their Pangeo data example), which seems to have the core functionality of xarray: labelled n-dimensional out-of-memory datasets w/ indexing and operations broadcasting, etc. (not sure how much it can all be distributed).

Distributed calculations is supported in ESDL! See this thread: https://github.com/esa-esdl/ESDL.jl/issues/170

This is very nice imho. The API is not totally clear in my head, but we have a working example of how we could do a massive big-data analysis through ESDL.

gaelforget commented 4 years ago

Distributed calculations is supported in ESDL! See this thread: esa-esdl/ESDL.jl#170

Cool. Will give it a try.

On a related note, I should mention https://github.com/gaelforget/ClimateTasks.jl (being registered now) which is meant to support distributed tasks (as opposed to the array, nc, etc parts) with a slightly more general but yet topical focus (e.g. to run models or analysis functions). Will expand on this thread soon ... once I have a couple more examples (the included example is an interpolation loop)

meggart commented 4 years ago

In case there is interest in a short introduction to the ESDL.jl API, I would be happy to have a call meeting where we could talk through the concepts in ESDL, comparisons to ClimateTasks.jl etc.

Balinus commented 4 years ago

Yes, I would be interested to know more about the details of ESDL and how I can use it for climate analysis. For instance, coming Friday (21st) might be possible for me.

Cheers!

gaelforget commented 4 years ago

In case there is interest in a short introduction to the ESDL.jl API, I would be happy to have a call meeting where we could talk through the concepts in ESDL, comparisons to ClimateTasks.jl etc.

Sorry for the lag in response -- I am still not running behind on a few things after coming back from OSM20 ...

Would be great to learn more about ESDL.jl which I have been meaning to try...

Unless the call meeting already happened, maybe next week would be good for all interested?

Balinus commented 4 years ago

Not too late to the show @gaelforget. :)

meggart commented 4 years ago

Yes, not too late. Let's try to schedule a telecon on potential use of ESDL, Maybe next week? Which time zones are you in? I am in CET and would be available in general either during the day (8am-5pm) or in the evening (after 8:30pm). When we know which times apply for all of us, we could try to fix a date, otherwise feel free to start a doodle or similar.

Balinus commented 4 years ago

I'm in Eastern time: https://www.timeanddate.com/time/zones/et

Now with the COVID-19, here in Québec all schools are closed, probably until mid-May. I have 3 small kids and needs to also work! Hence, not sure I'm gonna be free before a couple of weeks.

If the demo goes ahead, I suggest you try to record it. Might be valuable information/tutorial material.