ACCESS-NRI / access-nri-intake-catalog

Tools and configuration info used to manage ACCESS-NRI's intake catalogue
https://access-nri-intake-catalog.rtfd.io
Apache License 2.0
7 stars 0 forks source link

[BUG] Symlink handling #167

Closed anton-seaice closed 1 month ago

anton-seaice commented 2 months ago

Describe the bug

In some datasets in the FS38 project, the catalog returns multiple entries for the same files because there are symlinks used within the file structure.

e.g. lrwxrwxrwx 1 fo3_esgfpub 80 Sep 3 2020 /g/data/fs38/publications/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/SImon/sifb/gn/v20200817/sifb_SImon_ACCESS-CM2_historical_r1i1p1f1_gn_185001-201412.nc -> ../files/d20200817/sifb_SImon_ACCESS-CM2_historical_r1i1p1f1_gn_185001-201412.nc*

To Reproduce

cat = catalog.search(model="ACCESS-CM2", name="cmip6_fs38")

cat["cmip6_fs38"].search(
    realm="seaIce", 
    frequency="mon", 
    member_id="r1i1p1f1", 
    experiment_id="historical", 
    source_id="ACCESS-CM2",
    variable_id="sifb"
).df.path

which returns:

0    /g/data/fs38/publications/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/SImon/sifb/gn/files/d20200817/sifb_SImon_ACCESS-CM2_historical_r1i1p1f1_gn_185001-201412.nc
1          /g/data/fs38/publications/CMIP6/CMIP/CSIRO-ARCCSS/ACCESS-CM2/historical/r1i1p1f1/SImon/sifb/gn/v20200817/sifb_SImon_ACCESS-CM2_historical_r1i1p1f1_gn_185001-201412.nc
Name: path, dtype: object

And prevents the data being loaded using to_dask() because one entry in the catalogue is just a symlink to the other file.

Additional context

A workaround is to add an extra filter term, to only catch one of the entries. e.g:

path=".*gn/files.*"
dougiesquire commented 2 months ago

Thanks @anton-seaice. We don't actually build this (or any of the other CMIP) datastores. Instead, we just include these as built by NCI. The fact that there are both symlinks and hardlinks is a (somewhat annoying) feature of the NCI CMIP datastores. Fortunately there is a column in these datastores that indicates whether the file is a hard or symlink, so the workaround is to include that in your filter:

cat["cmip6_fs38"].search(
    realm="seaIce", 
    frequency="mon", 
    member_id="r1i1p1f1", 
    experiment_id="historical", 
    source_id="ACCESS-CM2",
    variable_id="sifb",
    file_type="f"
).df.path
anton-seaice commented 2 months ago

Ah right - I guess you've already talked to them / someone at NCI about changing it?

dougiesquire commented 2 months ago

Yeah - I don't think that's an option