Open dsj976 opened 3 weeks ago
Tagging @LydiaFrance and @louisavz
Yes, sounds good to me. I am just linking this as well https://github.com/ClimeTrend/dmd-turing-proj-mgmt/issues/14
That sounds good to me too. @dsj976 thanks for writing this up and @LydiaFrance for linking to the project management board!
Working on this branch: https://github.com/ClimeTrend/DMD-ERA5/tree/use-dvc
Just to add a quick summary of how DVC is implemented in DMD-ERA5
. Still work-in-progress but it's almost done.
dvc init
in an existing Git repo, it creates a .dvc
folder, similar to the .git
folderdvc
file. For instance, a file called 2019-01-01T00_2019-01-01T04_1h.nc
will be tracked by a file called 2019-01-01T00_2019-01-01T04_1h.nc.dvc
.dvc
file contains a md5 hash
that uniquely identifies the data file version. A typical dvc
file looks like this:outs:
- md5: e20d089b9be83af3492796fa46f90f9d
size: 20784411
hash: md5
path: 2019-01-01T00_2019-01-01T04_1h.nc
dvc
file must be committed to the Git repo, the data file does notgit checkout relevant-commit-sha path-to-dvc-file
dvc checkout
Where relevant-commit-sha
is the Git commit associated with a change to the dvc
file, where DVC logged the changes to corresponding data file.
data-file-name.yaml
, e.g. 2019-01-01T00_2019-01-01T04_1h.nc.yaml
, which looks like this:3611cb454dac3cf432dfcb44ced94fba:
source_path: gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2
start_datetime: 2019-01-01T00:00:00
end_datetime: 2019-01-01T04:00:00
hours_delta_time: 1.0
variables: ['temperature']
levels: [1000]
date_downloaded: 2024-11-06T17:20:44.157576
24e894fa0997e1d129a499c49f022ce0:
source_path: gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2
start_datetime: 2019-01-01T00:00:00
end_datetime: 2019-01-01T04:00:00
hours_delta_time: 1.0
variables: ['u_component_of_wind']
levels: [10]
date_downloaded: 2024-11-06T17:29:43.395375
e20d089b9be83af3492796fa46f90f9d:
source_path: gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2
start_datetime: 2019-01-01T00:00:00
end_datetime: 2019-01-01T04:00:00
hours_delta_time: 1.0
variables: ['v_component_of_wind']
levels: [10]
date_downloaded: 2024-11-07T14:17:12.406039
The YAML file is also tracked by Git. Each entry in the file has a header that corresponds to the relevant md5 hash
in the dvc
file. As you can see, the metadata about each data file version corresponds to the attributes that are added xarray.DataSet
. Following this strategy, one can identify the correct version of the data file to checkout depending on the desired variables or pressure levels. All of this will be automated, the user doesn't need to figure it out themselves.
More details in the incoming PR.
Following the discussion in #9, we realized that we are likely going to end up with many data files with similar names containing similar chunks of data. We are currently saving
era5-dowload
slices with the following name format:start_datetime-end_datetime-delta_time.nc
, e.g.2019-01-01T00_2019-01-05T00_1h.nc
.This file name should not contain the list of variables and pressure levels of the dataset, as this could be very large. So the file name by itself does not provide all of the information about the dataset, and one needs to open the dataset with
xarray
to read the attributes and get all of the information. If we are not careful, files like this can get easily overwritten (e.g. an ERA5 slice defined in the same time period and with the same delta time, but containing different variables and pressure levels.I suggest that together with this name convention, we use Data Version Control (DVC). DVC is used in parallel with Git to track different versions of a dataset. It allows to save metadata about a particular data file version in a
.dvc
file, to which we can add the attributes we are currently saving in the.nc
file. All ERA5 slices downloaded withera5_download
would be saved following thestart_datetime-end_datetime-delta_time.nc
name convention, but to avoid name clashes and overwriting files, DVC would be used.