JuliaClimate / meta

For discussions about JuliaClimate implementations
MIT License
6 stars 1 forks source link

Focus of the organization: Subjects, packages, notebooks, etc #1

Open Balinus opened 4 years ago

Balinus commented 4 years ago

Hello and welcome!

As a 1st discussion, I propose we list potentiel repos that could be included into JuliaClimate organization.

This will probably leads to a discussion about the focus of the organization. I'd say that anything close to Climate data, climate models and climate analysis could be potential candidates.

What do you think?

Cheers!

edit - Pinging people in the org that are not watching the meta repo (not sure they receive notifications, might be a good idea to "Watch" the repos in the future?) @Datseris @Alexander-Barth @visr @ali-ramadhan @aramirezreyes @glwagner @hdrake @milankl @natgeo-wong @navidcy @simonbyrne

gaelforget commented 4 years ago

This will probably leads to a discussion about the focus of the organization. I'd say that anything close to Climate data, climate models and climate analysis could be potential candidates.

Yes -- thanks @balinus for getting it started

What do you think?

This is more or less what I had in mind although I would probably add education & outreach to scope.

Also, I was wondering whether we should restrict ourselves strictly to package repos. I am thinking that jupyter notebook folders for example could be welcome -- often I separate those out from packages + they are a nice for science contributions, education purposes, demos via e.g. binder, etc

Balinus commented 4 years ago

This is more or less what I had in mind although I would probably add education & outreach to scope.

Also, I was wondering whether we should restrict ourselves strictly to package repos. I am thinking that jupyter notebook folders for example could be welcome -- often I separate those out from packages + they are a nice for science contributions, education purposes, demos via e.g. binder, etc

Totally agree about a broader scope of subjects and adding education & outreach.

The idea about adding jupyter notebooks is really good! This will be especially useful to get people started on JuliaClimate packages & subjects, as well as starting in Julia in general. Should we build a repo with notebooks or should we add a guidelines for package maintainers to add notebooks to their repos?

Speaking of guidelines, might be a good idea to write something that describe the org and our guidelines (once it's mature).

Datseris commented 4 years ago

Thanks for the ping, I was indeed not watching this repo! I'd say to just wait a bit for the "so much anticipated discourse post" about the scope of all these organizations.

I agree with your scopes, and I also want to add that for some reason I think JuliaEarth and JuliaClimate should become one: on one hand the orgs have modelling as a aim, on the other they both aim to analyze spatiotemporal fields and other data that are defined as fields over the earth (or other planets, there is no difference in the analysis process) through a series of techniques.

At the moment it seems to me that JuliaEarth is the "dedicated organization for GeoStats.jl, and all its dependencies", which I am not sure makes much sense. I think the scopes of JuliaEarth and JuliaClimate are the same, but the mathematical methods only differ, as JuliaEarth has a clear focus on statistical and stochastic approaches.

gaelforget commented 4 years ago

I propose we list potentiel repos that could be included into JuliaClimate organization.

In the short term, I plan to transfer MeshArrays.jl + Individualdisplacements.jl + NCTiles.jl to the org.

The first two basically cover what I presented in the JuliaCon-2018 presentation in relation to which I registered this org a while back. NCTiles.jl bridges between these two and Netcdf.jl or NCDatasets.jl with a focus on C-grid support. As a side note, I discussed with @lmilechin the notion of adding support climgrid, which might not be too hard.

The current use cases for these packages are based on MITgcm (ocean, atmosphere, seaice, bgc) and ECCO (state estimation) but the idea is to support analysis of popular climate model output on their native grids in parallel with multiple cores. Integration with packages like @milankl 's juls.jl or @ali-ramadhan 's Oceananigans.jl has also been on my mind. Am hoping that both types of extension could happen fairly soon (tbd).

gaelforget commented 4 years ago

At the moment it seems to me that JuliaEarth is the "dedicated organization for GeoStats.jl, and all its dependencies", which I am not sure makes much sense. I think the scopes of JuliaEarth and JuliaClimate are the same, but the mathematical methods only differ, as JuliaEarth has a clear focus on statistical and stochastic approaches.

Sounds like most of what is currently in JuliaEarth might fit nicely in Geo and Stat orgs.

... "so much anticipated discourse post" about the scope of all these organizations

Personally I don't mind having a multiplicity of organizations, am not surprised to see some overlap between various people's projects, and probably would not obsess about defining a rigid coordination of orgs. In the future, I would expect to see more of them getting created, reorganized, or split as the community grows. Folks who are not yet in academia or the Julia community should be able to have a say about these things ...

Datseris commented 4 years ago

Your last sentence raises a fair point I must admit.

Balinus commented 4 years ago

On my side I will migrate ClimateTools once ClimateBase and ClimatePlots are registered. Should happen in the next couple of week.

Balinus commented 4 years ago

As a side note, I discussed with @lmilechin the notion of adding support climgrid, which might not be too hard.

Nice! Perhaps this will be a good moment to refine the type. Do not hesitate to ping me for questions or suggestions!

milankl commented 4 years ago

Integration with packages like @milankl 's juls.jl or [...] has also been on my mind.

Happy for Juls.jl to be included. It's not yet registered, but that should happy fairly soon.

natgeo-wong commented 4 years ago

I am working on packages to download/analyze ECMWF reanalysis data, precipitation satellite data, and will eventually be also creating packages to analyse data from GCM outputs such as Isca, SAM, CESM and E3SM. I'm not sure if people would suggest combining these packages into more general packages, however.

The satellite package might go into JuliaGeo instead, depending on what satellites people want to use, and people should be able to contribute their own satellite retrieval algorithms.

I can also add some other minor things like GillMatsuno.jl that people can use if there is demand for people who just want to do some fun/small stuff with shallow water equations.

I've been pretty slow recently - sorry, was at AMS conference the last 3-4 days. Will get back to coding and other things soon!

visr commented 4 years ago

was at AMS conference the last 3-4 days

Nice! I guess you might have seen @rabernat's talk "What science can learn from open source": https://twitter.com/rabernat/status/1217479076455251970.

Worth mentioning here because perhaps we can draw some inspiration from how the Pangeo project organizes things. They are mostly focused on Python but have reached out to the Julia community on several occasions, see also for instance way 5 here: https://medium.com/pangeo/cmip6-in-the-cloud-five-ways-96b177abe396.

The projects goals may even be quite close to JuliaClimate if you replace python with julia in the first goal :)

  1. Foster collaboration around the open source scientific python ecosystem for ocean / atmosphere / land / climate science.
  2. Support the development with domain-specific geoscience packages.
  3. Improve scalability of these tools to handle petabyte-scale datasets on HPC and cloud platforms.

They have a GitHub organization with lots of material, but none of their core packages (xarray/iris/dask/jupyter) actually lives in that organization, since they are mainly existing, often domain agnostic, packages that are very useful for their domain specific purposes.

aramirezreyes commented 4 years ago

Hi! Now that @natgeo-wong mentioned SAM, I have a package that is not registered (ramirezreyes/SAMtools.jl) to analyze output of SAM. Now, one point I had against JuliaClimate as a name is that there are many thing that have to do with earth science that are not climate directly (I am not using this cloud resolving model to explore climate, but to explore tropical cyclones in idealized settings). This would fit under a "Earth" umbrella, but I suppose it can also fit here if we specify clearly the scope.

Now, this package needs a lot of work, would people be interested in me transerring it to JuliaClimate and working it from there? Or should I continue working on it and, when it gets to a more mature stage, then look to transfer it?

gaelforget commented 4 years ago

This reminds me that I wanted to ask @meggart whether zarr.jl already has a prospective home organization? I could see JuliaClimate be a good fit although one might argue that an org dedicated to IO may be best (not unlike for basic netcdf support)

natgeo-wong commented 4 years ago

Hey @aramirezreyes just checking, I haven't used SAM much, but I was thinking eventually of organising my packages such that the final result of my packages is a standardized output of data (which standardised parameter names, etc.) regardless of the model used (i.e. CESM, SAM, etc.) which would allow for easier analysis.

Of course, this would be applicable only to parameters that are more common, such as precipitation, wind, etc. And should be able to do things like output daily averages (for hourly output), or monthly and yearly averages, etc, standard deviation and variance, and so on.

aramirezreyes commented 4 years ago

That would make sense and would give a lot of meaning to those packages being in an organization.

hdrake commented 4 years ago

I'll just mention Mimi - a framework for integrated assessment (climate-economic) modelling. May be a good group to get ideas from / partner with. Thoughts, @davidanthoff? https://github.com/mimiframework/Mimi.jl

Balinus commented 4 years ago

One repo that should in theory be in JuliaClimate is CFTime.jl. CF standing for "Climate-Forecast". It handles calendar that are mostly designed for climate models.

@Alexander-Barth

davidanthoff commented 4 years ago

This here sounds exciting! I think our focus for Mimi.jl is less on the natural science side of climate change (which is probably the focus here?), but maybe a good idea would be to provide links from both communities to each other to point folks there?

The general repo/org structure we have for the Mimi story is that github.com/mimiframework hosts the generic Mimi.jl framework (and https://www.mimiframework.org/ is the public facing web page). Individual models are then either hosted under the lab where they are developed (e.g. github.com/anthofflab), a model specific org https://github.com/fund-model) or some individual research user accounts. We just track all of them on the www.mimiframework.org page, and in our custom registry.

I think if we ended up having packages that are not Mimi specific, but generic tools that relate to climate science, it would probably make sense to move them here, right? But right now I don't think we have anything like that :)

rabernat commented 4 years ago

Hi Folks. Thanks for tagging me in this thread. I'm new to Julia but excited to learn more. I have been heavily involved in Pangeo from the beginning, and I'm happy to share a few thoughts about how to organize and sustain a community effort like this.

My first would be to point out that it is very easy to start ambitious, new projects. It is much harder to sustain them in the long term. Our strategy with Pangeo has been to identify the packages which are useful to the climate community (xarray, dask, zarr, etc.) and provide a semi-unified public face for them by tying with specific scientific use cases. This has proved successful in terms of fundraising and community building. (By the way, do you know about https://github.com/esa-esdl/ESDL.jl? Seems very relevant here.)

Second--I would try to make your tent as big as possible. For example, xarray and dask aren't climate-specific at all. They are used by all sorts of scientists in different domains. By collaborating with these people, we can leverage a larger community of contributors. Taking stock of what the general-purpose stack could and couldn't do was a useful exercise for the Pangeo community and led, among other things, to the development of CFTime. Working on scientific software is already fairly thankless career-wise, so let's try to make our effort go as far as possible!

Finally, I would really encourage you to keep in mind interoperability with other languages. The Jupyter project can be an inspiration to us all here. I believe that the future of climate data science will involve interoperability between languages, specifically, Python, Julia and R. We should be thinking about data formats, protocols, standards, etc. that allow users to easily move between the three. (I had this in mind explicitly with the CMIP6 project--see https://github.com/pangeo-data/pangeo-julia-examples/.) Our data big analytics stack in python (xarray, dask, etc.) is pretty capable right now, and re-creating these same capabilities in Julia may not be a high priority. On the other hand, I'm really excited to try out some new modeling and analysis capabilities in Julia (e.g. https://github.com/CoherentStructures/CoherentStructures.jl). How can we make it easy to take advantage of what each language has to offer, given the limited time we have as scientists to spend on software development?

Pangeo does indeed identify itself as a python-specific language on the website, but the project is evolving. We would love to help facilitate collaboration across languages. With this in mind, I invite anyone interested to stop by one of our weekly calls if you want to discuss any of these issues in more depth: http://pangeo.io/meeting-notes.html

Good luck with JuliaClimate! I look forward to seeing what emerges from this effort!

gaelforget commented 4 years ago

Looping in @thabbott and @briochemc

meggart commented 4 years ago

This reminds me that I wanted to ask @meggart whether zarr.jl already has a prospective home organization? I could see JuliaClimate be a good fit although one might argue that an org dedicated to IO may be best (not unlike for basic netcdf support)

We have not found a home for Zarr.jl yet so why not JuliaClimate. Given that we have the CMIP6 data hosted on GCS in zarr format, it might be natural to have this here. Also, @visr is co-author of the package and already has an overview of the code.

Balinus commented 4 years ago

Hi Folks. Thanks for tagging me in this thread. I'm new to Julia but excited to learn more. I have been heavily involved in Pangeo from the beginning, and I'm happy to share a few thoughts about how to organize and sustain a community effort like this.

My first would be to point out that it is very easy to start ambitious, new projects. It is much harder to sustain them in the long term. Our strategy with Pangeo has been to identify the packages which are useful to the climate community (xarray, dask, zarr, etc.) and provide a semi-unified public face for them by tying with specific scientific use cases. This has proved successful in terms of fundraising and community building. (By the way, do you know about https://github.com/esa-esdl/ESDL.jl? Seems very relevant here.)

Welcome! Those are very good points. This is even more relevant for a young community such as Julia. I think that this point is connecting naturally with your 3rd comment about interoperability. We shouldn't be shy about leveraging package in other languages if the implementation is mature. Of course, in the long-run the ideal scenarios would be to have everything in Julia, just for the benefit of having everything coded in Julia as opposed to have a mix of Python, C/C++ code and Julia. The main advantage here being that this is much easier for (climate) students to develop core functionalities. This last point connects to one of the goals proposed by @gaelforget of the JuliaClimate org: outreach and education of science through scientific computation (I freely rephrased!).

Second--I would try to make your tent as big as possible. For example, xarray and dask aren't climate-specific at all. They are used by all sorts of scientists in different domains. By collaborating with these people, we can leverage a larger community of contributors. Taking stock of what the general-purpose stack could and couldn't do was a useful exercise for the Pangeo community and led, among other things, to the development of CFTime. Working on scientific software is already fairly thankless career-wise, so let's try to make our effort go as far as possible!

Also agree with that. I always found that climate research is connecting to a lot of core science fields. One aspect that I feel is not developed is the "out-of-core" + "distributed" calculations offered by xarray. I discovered ESDL.jl a couple of weeks ago and I feel that the base implementation is a good start in that direction. Perhaps one discussion we should have is to indeed to look at where we should put some of our effort in, with the aim of the whole being greater than the sum of its parts. Perhaps we'll feel that we should just leverage xarray and dask in the mid-term and develop other things or perhaps Julia can offer something more than xarray and dask.

Finally, I would really encourage you to keep in mind interoperability with other languages. The Jupyter project can be an inspiration to us all here. I believe that the future of climate data science will involve interoperability between languages, specifically, Python, Julia and R. We should be thinking about data formats, protocols, standards, etc. that allow users to easily move between the three. (I had this in mind explicitly with the CMIP6 project--see https://github.com/pangeo-data/pangeo-julia-examples/.) Our data big analytics stack in python (xarray, dask, etc.) is pretty capable right now, and re-creating these same capabilities in Julia may not be a high priority. On the other hand, I'm really excited to try out some new modeling and analysis capabilities in Julia (e.g. https://github.com/CoherentStructures/CoherentStructures.jl). How can we make it easy to take advantage of what each language has to offer, given the limited time we have as scientists to spend on software development?

One advantage (I feel) of julia is the interoperability. From experience, this is easier to use Python from Julia than the other way around though. When I began in julia, my functions and scripts were a mix of using PyCall and using MATLAB! For that matter, ClimatePlots is simply calling basemap. A discussion about where we should put our effort and where can we find fundraising opportunities is mandatory.

Pangeo does indeed identify itself as a python-specific language on the website, but the project is evolving. We would love to help facilitate collaboration across languages. With this in mind, I invite anyone interested to stop by one of our weekly calls if you want to discuss any of these issues in more depth: http://pangeo.io/meeting-notes.html

This is really interesting to read and it's probably a natural evolution of such a project. Language is just a tool for scientific research and, again, interoperability is a key here.

hdrake commented 4 years ago

This is fairly last minute, but for those in Boston, it may be worth bringing up JuliaClimate at the Open Julia Users Night at MIT tonight. @natgeo-wong @gaelforget @ali-ramadhan @glwagner

ali-ramadhan commented 4 years ago

Good point @hdrake. I'll definitely be there so I can add a slide/elevator pitch about @JuliaClimate unless someone else wants to do it (@gaelforget or @natgeo-wong?).

alanedelman commented 4 years ago

This group seems energized (bad choice of words?) so I don't need to say much more, but if I can add a little bit of encouragement, it is very clear to me that all of the following are desirable, and julia is best to make these happen:

So encouragement! encouragement! encouragement!

oh and for those at or near mit, i'll probably stop by at 4pm for a little bit

gaelforget commented 4 years ago

Good point @hdrake. I'll definitely be there so I can add a slide/elevator pitch about @JuliaClimate unless someone else wants to do it (@gaelforget or @natgeo-wong?).

Cool. Will be there too

gaelforget commented 4 years ago
  • An ideal universe allows the ability to mix and match the above two models with insurance models or other kinds of data. One gets the feeling that traditional codes are siloed, or perhaps the data from a run is available, but not the ability to run.

I like your other points too, @alanedelman but can't resist the opportunity to mention this global ocean model setup that you can run in forward mode (free) and adjoint ($ compiler). Am hoping to rerun the include AWS recipe between now and our IAP session thur. next week...

So encouragement! encouragement! encouragement!

Thank you very much for this!!

natgeo-wong commented 4 years ago

@ali-ramadhan @gaelforget I'm swamped in meetings today so I'm not going to be around! I do look forward to seeing what you guys present tho!

meggart commented 4 years ago

I discovered ESDL.jl a couple of weeks ago and I feel that the base implementation is a good start in that direction.

Thanks for checking out ESDL.jl. Since it was mentioned here I will say a few words about the status of the package. The package was started for a project in 2015 (julia 0.4), has grown organically in our lab and is now used by a second generation of PhD students. It basically grew out of the need to to out-of-core time series analysis on a particular geospatial dataset so we implemented an an ad-hoc dimensional data type, some plotting tools etc, but the core was always the mapCube function which provides a quite powerful interface for mapping functions on arbitrary slices of multiple out-of-core datasets (doing broadcasting on the fly). So this processing was always the main focus of development and is IMO very well-tested and has already seen a lot of corner-cases. You can combine multi-process with multi-threaded computations while all parallelization happens gracefully in the background. However, the tools for data selecting, subsetting, slicing, exporting results etc have more or less been added gradually as needed so they don't follow a very well designed common interface. Also, in particular keeping stuff like plotting tools up to date is an uphill battle against a moving target (julia versions from 0.4 to 1.3, countless versions and re-designs of WebIO), which makes user experience a bit unpleasant sometimes. In addition, the package is heavily under-documented so I think it takes a lot of stamina to learn the package when not being part of our lab. So I am looking forward to base the processing tools from ESDL on new developments like in DimensionalData.jl in the future and hopefully to use some great Geo-related pure julia plotting package soon, which should release of lot of maintenance-burden in the future.

@Balinus from your post it looks like you found the out-of-core processing from ESDL too limiting. Some feedback would be very welcome, maybe in an issue at ESDL.jl? Can you describe what kind of workflow exactly you had in mind which would only be possible with xarray/dask and not in ESDL.jl? From the processing side I find the ESDLs mapslices and mapCube more intuitive than xarray's apply_func, e.g. I would not know how to efficiently do this workflow in xarray (fitting a PCA on multivariate time series for every pixel in a gloabl dataset)

perhaps Julia can offer something more than xarray and dask.

I have been learning/ using xarray+dask quite extensively in the last months, so would be happy to join this discussion.

Balinus commented 4 years ago

@Balinus from your post it looks like you found the out-of-core processing from ESDL too limiting. Some feedback would be very welcome, maybe in an issue at ESDL.jl?

Hi meggart, thanks for the feedback!

I guess that what I found "lacking" about the out-of-core processing is simply related to the fact I haven't understood how to setup the calculations to use multiple machine (and not about any conceptual problem). For instance, I have access to a cluster of ~200 machine with 32-cores each and we use Slurm as a scheduler. This is very new to me (clusters) so sorry if the solution is evident: so say I want mapCube to launch calculations on a given Cube on 20 machines, should I setup something particular in ESDL or should I design my process outside of ESDL (perhaps using ClusterManagers.jl)? I know how to use ClusterManagers to launch independant tasks/simulations on the cluster, but how can I share the same Cube and distribute over all available machine? That part is not evident for me yet.

In other words, I think what ESDL needs to grow outside of your lab is some "Getting started" tutorial where people can understand a little bit more about the internals of ESDL. Perhaps it could simply be done by the PhD students: when they begin learning the tool, they write there thoughts in the "Getting started" tutorial? Or what is explained to PhD students should be written in this tutorial. Perhaps I missed the right tutorial as I know there is some on esdl-shared, but I haven't found examples of ESDL setup. In that case, something in the README.md would be enough!

On my side, I will try to contibute to it, especially if I can find a way to use the reviously mentioned cluster.

N.B. just tried to read the docs to make sure I haven't missed a tutorial but the link seems broken.

gaelforget commented 4 years ago

Thanks a lot @meggart & @balinus for the great discussion on Xarray, distributed, etc. Very useful!

I have had those in mind in designing MeshArrays & NCTiles too. I might suggest we continue this thread in #2 to keep the scope thread as short and focused as possible -- would that be ok?

gaelforget commented 4 years ago

In case you have not gotten a chance to express opinions about priorities in #3 please consider doing so (I still need to do that myself ...) Ideally it would be great if most of the (currently) 20 org members could weigh in by end of Feb.

On a related note it seems that most members have not made their membership public. That's totally fine if this indeed is your preference but I thought maybe it's just an oversight in some cases.

gaelforget commented 4 years ago

This is fairly last minute, but for those in Boston, it may be worth bringing up JuliaClimate at the Open Julia Users Night at MIT tonight. @natgeo-wong @gaelforget @ali-ramadhan @glwagner

Just to follow up on this for those who weren't there -- myself, @ali-ramadhan @glwagner & @hdrake did show up and advertised the org which I feel like was well received.

FYI another opportunity for advertisement will be this workshop on Tue where I intend to at least mention the org along with JuliaOcean that I imagine could focus on the more topical, oceanography stuff

natgeo-wong commented 4 years ago

Hi guys! Just a note that I've pulled out ClimateEasy.jl out of the stack because I'm deprecating it in favour of GeoRegions.jl https://github.com/natgeo-wong/GeoRegions.jl

Currently writing up documentation for GeoRegions.jl and if anyone has any suggestions please let me know!