Closed anton-seaice closed 1 month ago
Thanks @anton-seaice. We don't actually build this (or any of the other CMIP) datastores. Instead, we just include these as built by NCI. The fact that there are both symlinks and hardlinks is a (somewhat annoying) feature of the NCI CMIP datastores. Fortunately there is a column in these datastores that indicates whether the file is a hard or symlink, so the workaround is to include that in your filter:
cat["cmip6_fs38"].search(
realm="seaIce",
frequency="mon",
member_id="r1i1p1f1",
experiment_id="historical",
source_id="ACCESS-CM2",
variable_id="sifb",
file_type="f"
).df.path
Ah right - I guess you've already talked to them / someone at NCI about changing it?
Yeah - I don't think that's an option
Describe the bug
In some datasets in the FS38 project, the catalog returns multiple entries for the same files because there are symlinks used within the file structure.
e.g.
lrwxrwxrwx 1 fo3_esgfpub 80 Sep 3 2020 /g/data/fs38/publications/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/SImon/sifb/gn/v20200817/sifb_SImon_ACCESS-CM2_historical_r1i1p1f1_gn_185001-201412.nc -> ../files/d20200817/sifb_SImon_ACCESS-CM2_historical_r1i1p1f1_gn_185001-201412.nc*
To Reproduce
which returns:
And prevents the data being loaded using
to_dask()
because one entry in the catalogue is just a symlink to the other file.Additional context
A workaround is to add an extra filter term, to only catch one of the entries. e.g: