Architecture and organization of cryo-data folders and datasets

mankoff commented 2 years ago

At our launch meeting the suggestion to organize by DOI, and then link from other organize folders to the main DOI folder. This issues is to discuss and refine this suggestion.

Base folder organization

Organizing by DOI has two issues, both solveable:
- Not all datasets have DOIs. They should, we can create DOIs (Zenodo?) for those that don't, or use the HTTP URL if there is no DOI
- How do we convert DOIs to folders?
- What about Journal of Climate papers (and maybe datasets) with DOIs like 10.1175/1520-0442(2004)017<1123:RVIUDT>2.0.CO;2 ? On my Linux box from trial-and-error, all those are valid in a directory name except /. I'm not sure about POSIX-compliance yet though. Nor other OSes.
- DOIs do contain characters that cannot be included in folder names (i.e. /). We should define a regex to fix this. Is there a character that can be used to replace the OS-invalid characters in such a way that we can revert back to the true DOI? For example, if space (`) is not valid in a DOI, then replacing/with ` allows the operation to be undone (note that spaces in folder names is a pain). Perhaps we store things using a DOI-ish string that cannot be reverted back to the original DOI?
- Or do we build nested folders. For example my latest product has DOI 10.22008/FK2/OHI23Z. Should we use that as the folder structure, and the data is in OHI23Z?
- In any case, if we aren't using the actual DOI, maybe this main folder should be called something different, like db, or data, or main rather than doi?

Secondary (organizational) folders

Linking (ln on Linux, Aliases on OS X, something else on Windows) from other organizing folders is an OS-specific operation. Perhaps we should be cloning the repositories from the db folder (name TBD) to the other folders? The workflow would then be generic and custom ingest scripts to build a large db folder of datalad datastets, then tag (with metadata?) which other folders each dataset should appear in and under what name, and then a simple script that clones from db/foo to author/bar_YYYY and project/BazName after anything new gets added to the db folder (?).

Which other folders should we use? I suggest at least:

author (e.g. author/mankoff_2020, author/mankoff_2021)
project (e.g. project/PROMICE, project/Thwaites)
organization (e.g. org/GEUS, org/NSIDC, org/UniversityOfColoradoBoulder, maybe org/PROMICE appears here too for example)

However, within these folders we need meaningful names. How do we decide these? For example, my total mass balance is db/10.22008_slash_FK2_slash_OHI23Z/. But what do I name it elsewhere and how do we decide on those names? Presumably author/mankoff_2021, but what about project/PROMICE/TotalMassBalance or is it also project/PROMICE/mankoff_2021?

Perhaps the point of the metadata search capabilities is to solve this.

MartinLuethi commented 2 years ago

Secondary (organizational) folders

Datalad (i.e. git-annex) uses hashes to store the files. Then symlinks to these hashes replace the file in the hierarchy. This means, that you can simply copy this symlink to another directory and register it with git-annex add. I do that all the time.

Since git-annex and datalad also run on Windows, I think that this is fully supported. The only problem is with FAT file systems that still are on USB disks and sticks.

So to answer the question: do it all, the files can live in different hierarchies at once, which make things much simpler. Datasets can be ordered by name, project, organization etc.

mankoff commented 2 years ago

In addition to Base Folder Org and Secondary Folder Org (see above), we also need to organize the development environment. I believe each datalad dataset is its own git repository. They can then be organized or nested into groups (for example, db, author, project, and organization). These groups will then be organized/nested into the final cryo-data repository.

This means our cryo-data organization (https://github.com/cryo-data/) will soon have 10s or 100s of repositories. That's fine, but it is worth pointing out. GitLab allows nested projects within an organization, so we could have a db project to keep things tidy at the top-level namespace. I think the social/community benefits of using GitHub outweigh the messy 1-level-deep organization that we can do. It is only a few of us, and only tech-savvy people who will be viewing the organization anyway. Most users will be directed to the cryo-data/ repository (top-level parent repository) within the cryo-data Organization (that repository does not yet exist). But if we are going to move to GitLab, we should do it now, not later.

Vote: 👍🏿 stay here 🚀 switch

MartinLuethi commented 2 years ago

I'm` not sure I understand how this will work. How will a user choose ordownload/install different repositories? It might be worth doing a concrete example such that we really understand what's going on, and how this can be made as simple as possible.

mankoff commented 2 years ago

I'm picturing one entry point for the user: cryo-data. A datalad install cryo-data would produce this folder structure:

cryo-data
├── author
│   ├── lüthi_2002
│   ├── mankoff_2020
│   └── mankoff_2021
├── db
│   ├── doi_abc
│   ├── doi_bar
│   ├── doi_egf
│   ├── doi_etc
│   └── doi_foo
├── org
│   └── PROMICE
│       ├── solid_ice_discharge
│       └── watson_river
└── project
    └── PROMICE
        ├── solid_ice_discharge
        └── van_as_2099

Where everything must be in db, but may not exist elsewhere, and possibly db/doi_abc is cloned to author/mankoff_2020 and also project/PROMICE/solid_ice_discharge.

This is what the user sees. As we develop it though, each folder is a datalad dataset, and therefore each folder is a git repository in the cryo-data organization, unless a project builds their own datalad dataset in which case we just clone that into our cryo-data. For example, if UHZ has a dataset (made of nested datasets), then we can datalad clone UHZ into our org folder. We have to be careful, because if UHZ includes cryo-data, then we've set up a circular reference.

This is 1) just an idea of how we might organize things and 2) based on my still-basic understanding of datalad and git-annex. I may be imagining something wrong, incorrect, inefficient, etc.

mankoff commented 2 years ago

If the above set of nested datasets is installed from datalad install cryo-data, then to answer your question, the user chooses different repositories by datalad get org/PROMICE.

Also, because everything is nestable, users could just datalad install PROMICE and skip the cryo-data/org/ part of the tree.

mankoff commented 2 years ago

It might be worth doing a concrete example such that we really understand what's going on, and how this can be made as simple as possible.

I agree. I will create ~5 datasets as I imagine them above. This will also let us start concretely addressing the #6 metadata issue. Anyone else who wants to can also do the same, and we can then discuss/compare/contrast different approaches.

cryo-data / discuss

Architecture and organization of cryo-data folders and datasets #5

Base folder organization

Secondary (organizational) folders