CliMA / ClimaOcean.jl

🌎 Ocean component for CliMa's Earth system model based on Oceananigans
MIT License
24 stars 6 forks source link

Restoring to ECCO4 data #81

Open simone-silvestri opened 2 months ago

simone-silvestri commented 2 months ago

We need a utility to download/use the ECCO4 fields as restoring data. ECCO4 fields come in netcdf format where each state for a particular day is stored in a file like this

There are two main options to do this:

simone-silvestri commented 2 months ago

An additional drawback of method 2 is that we need to preprocess the data anyway because we need to inpaint missing regions.

Mostly for this reason I would probably favour method 1

glwagner commented 2 months ago

Can we also look into the European Copernicus reanalysis data? I wonder also if there is a more intelligent way to download it, like downloading slices or something. I'm unsure about the pros and cons of the different data products, but for the purposes of restoring or even initial conditions, I'm not sure a reanalysis is necessarily worse than ECCO's state estimate (which differs in that ECCO is more dynamically consistent, somehow)

glwagner commented 2 months ago

Cons: We might be stuck with having a gigantic 555 GB datafile to dowload (the same problem we will eventually have with the atmosphere)

How long does this take to download? In 2024 we have fast internet, maybe this is just life and we can accept it. We can build some tools that help users do a download once and store the data in some common that ClimaOcean knows about (eg independent from individual run scripts). That's what DataDeps did though I think it's simpler to use an Artifacts.toml

simone-silvestri commented 2 months ago

I have figured out that it is practically impossible to write a netcdf file that contains the whole ECCO dataset, we are talking about around 660 GB per variable in float32 format.

I think we can circumvent this by loading one fieldtimeseries time index from a single file by implementing a different backend like ECCONetCDFBackend. This will probably not kill performance too much since we can use the daily means from ECCO (or even the monthly means) so the load will happen rather rarely

glwagner commented 2 months ago

Why do you have to write a single nc file?

glwagner commented 2 months ago

Can't you load data from the original nc files on the fly, the same way we do for the JRA55 nc files? This also saves a pre-processing step which is nice.

glwagner commented 2 months ago

You have to define set! and new_backend for the Oceananigans.FieldTimeSeries interface:

you can probably also reuse compute_bounding_indices and move that to DataWrangling to use in both ECCO and JRA55.

glwagner commented 2 months ago

Don't we also need a new module called ECCO4? That can be a first PR that just defines the module and adds some basic functionality.

simone-silvestri commented 2 months ago

Can't you load data from the original nc files on the fly, the same way we do for the JRA55 nc files? This also saves a pre-processing step which is nice.

This is exactly what i was suggesting. We always have to preprocess though because ECCO files have missing values that have to be filled in to a certain extent

simone-silvestri commented 2 months ago

Creating a new ECCO4 module is probably unnecessary since the only difference between our current ECCO2 and ECCO4 is the filename to download from. I was thinking of just renaming the ECCO2 module to ECCO4 (since ECCO4 is a little more dynamically consistent)

Another option is the rename the module to ECCO and just give duplicate the download files dictionary to include both ECCO2 and ECCO4 so we can have the maximum code reutilization

glwagner commented 2 months ago

JRA55 isn't dynamically consistent and we support that. Is the advantage of ECCO2 that it's higher resolution? Or no?

simone-silvestri commented 2 months ago

Ok, I think it is possible to support ECCO2Daily, ECCO2Monthly and ECCO4Montly with only 4 lines change in the code. Luckily the structure of the .nc file does not change between these

glwagner commented 2 months ago

Can't you load data from the original nc files on the fly, the same way we do for the JRA55 nc files? This also saves a pre-processing step which is nice.

This is exactly what i was suggesting. We always have to preprocess though because ECCO files have missing values that have to be filled in to a certain extent

Ok, I was confused since I assumed we would have to do this. So I didn't understand the context of what you were saying. I didn't realize you were trying something different. I think it would help to write a bit more like "I wanted to explore whether we could avoid loading data from separate .nc files by writing a single huge .nc file. But it turns out that its too big."

glwagner commented 2 weeks ago

dowload the data, build a new ECCO4NetCDFBackend <: AbstractInMemoryBackend which will load individual snapshots in the fieldtimeseries data. Pros: flexibility with how much data we want to download. Cons: more coding and less code reutilization

I think this is the right way to go.

Keep this overarching goal in mind: our goal is to make it as easy as possible for new users to start using the code, and also to port setups between machines and change setups. Because of this priority, the workflow where we "preprocess a huge dataset and then keep using it for the next 3 years" is not the kind of workflow we want to promote.

Instead we want to promote a workflow where we re-download and re-process data often.

I don't think we want to opt to download huge files and make pre-processing really expensive just to save a bit of coding.