carbonplan / cmip6-downscaling

Climate downscaling using CMIP6 data
https://docs.carbonplan.org/cmip6-downscaling
MIT License
177 stars 27 forks source link

Data organization #45

Open orianac opened 2 years ago

orianac commented 2 years ago

As a way to offload my thinking about caching/data store organization/clean-up routines (@norlandrhagen @jhamman @tcchiao), here are some thoughts about how the data ins and outs for the bcsd workflow are organized currently and how that system could be improved. Hopefully this can complement whatever scheme you're setting up!

norlandrhagen commented 2 years ago

Great thoughts @orianac! @jhamman and I started thinking about the organization of the azure directory a bit. In cleanup helper functions, rechunker results could be cleaned up at the end of a flow with fsspec. Isolating the 'temp' products in a read/write permissions bucket might save us from accidental deletions of obs/gcm data.

A very rough schema:

Would love some input!

tcchiao commented 2 years ago

Curious how we're thinking about differentiating between scratch/intermediate_products, scratch/rechunker_results, and prefect buckets? For example, is prefect/ for storing final results only? Should all disposable intermediate material go into rechunker_results even though they're not from the rechunker?

tcchiao commented 2 years ago

Proposal to have 3 buckets:

  1. intermediary -- intermediary outputs that we might want to inspect (e.g. bias correction results). Files should be in paths that are identifiable (e.g. {gcm}_{obs}_{method}.zarr). No automatic cleaning.
  2. cache -- anything we might want to cache to save time but wouldn't want to inspect (e.g. rechuncker output). File paths can be random strings. No automatic cleaning.
  3. scratch -- things that can be deleted after each model run or each week (e.g. rechunker intermediate output). Can have automatic cleaning
orianac commented 2 years ago

We also need to put results somewhere. Maybe:

results/ ---data/ -------daily/ -------monthly/ -------annual/ ---analyses/ ------qaqc/ ------metrics/

Though the results/data would probably end up being housed in some other bucket- but for now we could work under that structure. Thoughts?

norlandrhagen commented 2 years ago

Thanks for the feedback. Does this structure capture everything? Everything has read / write / delete permissions, except for the input data.


.
├── flow_outputs (read / write / delete)
│   ├── cache
│   ├── intermediary
│   │   └── {gcm}_{obs}.zarr
│   │   └── {gcm}_{obs}_{method}.zarr
│   └── scratch
├── inputs (read / write)
│   ├── CMIP6
│   ├── scenariomip
│   ├── CR2MET
│   ├── ERA5
│   └── ERA5_daily
├── prefect (read / write / delete)
└── results (read / write / delete)
    ├── analyses
    │   ├── metrics
    │   └── qaqc
    └── data
        ├── annual
        ├── daily
        └── monthly
jhamman commented 2 years ago

small comment, should the CMIP6 and input_data buckets be read-only?

norlandrhagen commented 2 years ago

They definitely can be! If we wanted to add more input data would we temporarily change the permissions?

jhamman commented 2 years ago

The containers themselves will always have read/write permissions. What we're really talking about is, as a matter of practice, using read-only access methods (credentials or otherwise) to access certain storage spaces.

I think in most cases, we can access the cmip6 and training containers without credentials, which will limit us to read-only. For other containers, we should probably be using custom SAS tokens for specific applications.

jhamman commented 2 years ago

Coming back to this issue, I'd like to get a few things finalized. In particular, there's some additional cleanup needed in the cmip6 container. @norlandrhagen, can you confirm we aren't using and then remove the following paths:

Then can we work on moving the following paths to the training container:

These last three will require some path updates to the data catalogs in this repo as well as the two catalogs listed above.

cc @andersy005 for visibility on the catalog side.