Use Data Version Control

dsj976 commented 3 weeks ago

Following the discussion in #9, we realized that we are likely going to end up with many data files with similar names containing similar chunks of data. We are currently saving era5-dowload slices with the following name format:

start_datetime-end_datetime-delta_time.nc, e.g. 2019-01-01T00_2019-01-05T00_1h.nc.

This file name should not contain the list of variables and pressure levels of the dataset, as this could be very large. So the file name by itself does not provide all of the information about the dataset, and one needs to open the dataset with xarray to read the attributes and get all of the information. If we are not careful, files like this can get easily overwritten (e.g. an ERA5 slice defined in the same time period and with the same delta time, but containing different variables and pressure levels.

I suggest that together with this name convention, we use Data Version Control (DVC). DVC is used in parallel with Git to track different versions of a dataset. It allows to save metadata about a particular data file version in a .dvc file, to which we can add the attributes we are currently saving in the .nc file. All ERA5 slices downloaded with era5_download would be saved following the start_datetime-end_datetime-delta_time.nc name convention, but to avoid name clashes and overwriting files, DVC would be used.

dsj976 commented 3 weeks ago

Tagging @LydiaFrance and @louisavz

LydiaFrance commented 3 weeks ago

Yes, sounds good to me. I am just linking this as well https://github.com/ClimeTrend/dmd-turing-proj-mgmt/issues/14

louisavz commented 3 weeks ago

That sounds good to me too. @dsj976 thanks for writing this up and @LydiaFrance for linking to the project management board!

dsj976 commented 2 weeks ago

Working on this branch: https://github.com/ClimeTrend/DMD-ERA5/tree/use-dvc

dsj976 commented 1 week ago

Just to add a quick summary of how DVC is implemented in DMD-ERA5. Still work-in-progress but it's almost done.

DVC works together with Git. When you run dvc init in an existing Git repo, it creates a .dvc folder, similar to the .git folder
Data files are tracked by DVC with a dvc file. For instance, a file called 2019-01-01T00_2019-01-01T04_1h.nc will be tracked by a file called 2019-01-01T00_2019-01-01T04_1h.nc.dvc.
The dvc file contains a md5 hash that uniquely identifies the data file version. A typical dvc file looks like this:

outs:
- md5: e20d089b9be83af3492796fa46f90f9d
  size: 20784411
  hash: md5
  path: 2019-01-01T00_2019-01-01T04_1h.nc

The dvc file must be committed to the Git repo, the data file does not
One can retrieve an old version of a data file by doing the following:

git checkout relevant-commit-sha path-to-dvc-file
dvc checkout

Where relevant-commit-sha is the Git commit associated with a change to the dvc file, where DVC logged the changes to corresponding data file.

To keep a detailed log of the history of a data file, I have created functionality that produces a YAML file called data-file-name.yaml, e.g. 2019-01-01T00_2019-01-01T04_1h.nc.yaml, which looks like this:

3611cb454dac3cf432dfcb44ced94fba:
  source_path: gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2
  start_datetime: 2019-01-01T00:00:00
  end_datetime: 2019-01-01T04:00:00
  hours_delta_time: 1.0
  variables: ['temperature']
  levels: [1000]
  date_downloaded: 2024-11-06T17:20:44.157576
24e894fa0997e1d129a499c49f022ce0:
  source_path: gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2
  start_datetime: 2019-01-01T00:00:00
  end_datetime: 2019-01-01T04:00:00
  hours_delta_time: 1.0
  variables: ['u_component_of_wind']
  levels: [10]
  date_downloaded: 2024-11-06T17:29:43.395375
e20d089b9be83af3492796fa46f90f9d:
  source_path: gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2
  start_datetime: 2019-01-01T00:00:00
  end_datetime: 2019-01-01T04:00:00
  hours_delta_time: 1.0
  variables: ['v_component_of_wind']
  levels: [10]
  date_downloaded: 2024-11-07T14:17:12.406039

The YAML file is also tracked by Git. Each entry in the file has a header that corresponds to the relevant md5 hash in the dvc file. As you can see, the metadata about each data file version corresponds to the attributes that are added xarray.DataSet. Following this strategy, one can identify the correct version of the data file to checkout depending on the desired variables or pressure levels. All of this will be automated, the user doesn't need to figure it out themselves.

More details in the incoming PR.

ClimeTrend / DMD-ERA5

Use Data Version Control #10