cryo-data / discuss

Discussions for cryo-data
3 stars 0 forks source link

Architecture and organization of cryo-data folders and datasets #5

Open mankoff opened 2 years ago

mankoff commented 2 years ago

At our launch meeting the suggestion to organize by DOI, and then link from other organize folders to the main DOI folder. This issues is to discuss and refine this suggestion.

Base folder organization

Secondary (organizational) folders

Linking (ln on Linux, Aliases on OS X, something else on Windows) from other organizing folders is an OS-specific operation. Perhaps we should be cloning the repositories from the db folder (name TBD) to the other folders? The workflow would then be generic and custom ingest scripts to build a large db folder of datalad datastets, then tag (with metadata?) which other folders each dataset should appear in and under what name, and then a simple script that clones from db/foo to author/bar_YYYY and project/BazName after anything new gets added to the db folder (?).

Which other folders should we use? I suggest at least:

However, within these folders we need meaningful names. How do we decide these? For example, my total mass balance is db/10.22008_slash_FK2_slash_OHI23Z/. But what do I name it elsewhere and how do we decide on those names? Presumably author/mankoff_2021, but what about project/PROMICE/TotalMassBalance or is it also project/PROMICE/mankoff_2021?

Perhaps the point of the metadata search capabilities is to solve this.

MartinLuethi commented 2 years ago

Secondary (organizational) folders

Datalad (i.e. git-annex) uses hashes to store the files. Then symlinks to these hashes replace the file in the hierarchy. This means, that you can simply copy this symlink to another directory and register it with git-annex add. I do that all the time.

Since git-annex and datalad also run on Windows, I think that this is fully supported. The only problem is with FAT file systems that still are on USB disks and sticks.

So to answer the question: do it all, the files can live in different hierarchies at once, which make things much simpler. Datasets can be ordered by name, project, organization etc.

mankoff commented 2 years ago

In addition to Base Folder Org and Secondary Folder Org (see above), we also need to organize the development environment. I believe each datalad dataset is its own git repository. They can then be organized or nested into groups (for example, db, author, project, and organization). These groups will then be organized/nested into the final cryo-data repository.

This means our cryo-data organization (https://github.com/cryo-data/) will soon have 10s or 100s of repositories. That's fine, but it is worth pointing out. GitLab allows nested projects within an organization, so we could have a db project to keep things tidy at the top-level namespace. I think the social/community benefits of using GitHub outweigh the messy 1-level-deep organization that we can do. It is only a few of us, and only tech-savvy people who will be viewing the organization anyway. Most users will be directed to the cryo-data/ repository (top-level parent repository) within the cryo-data Organization (that repository does not yet exist). But if we are going to move to GitLab, we should do it now, not later.

Vote: πŸ‘πŸΏ stay here πŸš€ switch

MartinLuethi commented 2 years ago

I'm` not sure I understand how this will work. How will a user choose ordownload/install different repositories? It might be worth doing a concrete example such that we really understand what's going on, and how this can be made as simple as possible.

mankoff commented 2 years ago

I'm picturing one entry point for the user: cryo-data. A datalad install cryo-data would produce this folder structure:

cryo-data
β”œβ”€β”€ author
β”‚Β Β  β”œβ”€β”€ lΓΌthi_2002
β”‚Β Β  β”œβ”€β”€ mankoff_2020
β”‚Β Β  └── mankoff_2021
β”œβ”€β”€ db
β”‚Β Β  β”œβ”€β”€ doi_abc
β”‚Β Β  β”œβ”€β”€ doi_bar
β”‚Β Β  β”œβ”€β”€ doi_egf
β”‚Β Β  β”œβ”€β”€ doi_etc
β”‚Β Β  └── doi_foo
β”œβ”€β”€ org
β”‚Β Β  └── PROMICE
β”‚Β Β      β”œβ”€β”€ solid_ice_discharge
β”‚Β Β      └── watson_river
└── project
    └── PROMICE
        β”œβ”€β”€ solid_ice_discharge
        └── van_as_2099

Where everything must be in db, but may not exist elsewhere, and possibly db/doi_abc is cloned to author/mankoff_2020 and also project/PROMICE/solid_ice_discharge.

This is what the user sees. As we develop it though, each folder is a datalad dataset, and therefore each folder is a git repository in the cryo-data organization, unless a project builds their own datalad dataset in which case we just clone that into our cryo-data. For example, if UHZ has a dataset (made of nested datasets), then we can datalad clone UHZ into our org folder. We have to be careful, because if UHZ includes cryo-data, then we've set up a circular reference.

This is 1) just an idea of how we might organize things and 2) based on my still-basic understanding of datalad and git-annex. I may be imagining something wrong, incorrect, inefficient, etc.

mankoff commented 2 years ago

If the above set of nested datasets is installed from datalad install cryo-data, then to answer your question, the user chooses different repositories by datalad get org/PROMICE.

Also, because everything is nestable, users could just datalad install PROMICE and skip the cryo-data/org/ part of the tree.

mankoff commented 2 years ago

It might be worth doing a concrete example such that we really understand what's going on, and how this can be made as simple as possible.

I agree. I will create ~5 datasets as I imagine them above. This will also let us start concretely addressing the #6 metadata issue. Anyone else who wants to can also do the same, and we can then discuss/compare/contrast different approaches.