Open mankoff opened 2 years ago
Secondary (organizational) folders
Datalad (i.e. git-annex) uses hashes to store the files. Then symlinks to these hashes replace the file in the hierarchy. This means, that you can simply copy this symlink to another directory and register it with git-annex add. I do that all the time.
Since git-annex and datalad also run on Windows, I think that this is fully supported. The only problem is with FAT file systems that still are on USB disks and sticks.
So to answer the question: do it all, the files can live in different hierarchies at once, which make things much simpler. Datasets can be ordered by name, project, organization etc.
In addition to Base Folder Org and Secondary Folder Org (see above), we also need to organize the development environment. I believe each datalad dataset is its own git repository. They can then be organized or nested into groups (for example, db
, author
, project
, and organization
). These groups will then be organized/nested into the final cryo-data
repository.
This means our cryo-data organization (https://github.com/cryo-data/) will soon have 10s or 100s of repositories. That's fine, but it is worth pointing out. GitLab allows nested projects within an organization, so we could have a db
project to keep things tidy at the top-level namespace. I think the social/community benefits of using GitHub outweigh the messy 1-level-deep organization that we can do. It is only a few of us, and only tech-savvy people who will be viewing the organization anyway. Most users will be directed to the cryo-data/
repository (top-level parent repository) within the cryo-data
Organization (that repository does not yet exist). But if we are going to move to GitLab, we should do it now, not later.
Vote: ππΏ stay here π switch
I'm` not sure I understand how this will work. How will a user choose or
download/install different repositories?
It might be worth doing a concrete example such that we really understand what's going on, and how this can be made as simple as possible.
I'm picturing one entry point for the user: cryo-data
. A datalad install cryo-data
would produce this folder structure:
cryo-data
βββ author
βΒ Β βββ lΓΌthi_2002
βΒ Β βββ mankoff_2020
βΒ Β βββ mankoff_2021
βββ db
βΒ Β βββ doi_abc
βΒ Β βββ doi_bar
βΒ Β βββ doi_egf
βΒ Β βββ doi_etc
βΒ Β βββ doi_foo
βββ org
βΒ Β βββ PROMICE
βΒ Β βββ solid_ice_discharge
βΒ Β βββ watson_river
βββ project
βββ PROMICE
βββ solid_ice_discharge
βββ van_as_2099
Where everything must be in db
, but may not exist elsewhere, and possibly db/doi_abc
is cloned to author/mankoff_2020
and also project/PROMICE/solid_ice_discharge
.
This is what the user sees. As we develop it though, each folder is a datalad dataset, and therefore each folder is a git repository in the cryo-data
organization, unless a project builds their own datalad dataset in which case we just clone that into our cryo-data
. For example, if UHZ has a dataset (made of nested datasets), then we can datalad clone UHZ
into our org
folder. We have to be careful, because if UHZ includes cryo-data
, then we've set up a circular reference.
This is 1) just an idea of how we might organize things and 2) based on my still-basic understanding of datalad and git-annex. I may be imagining something wrong, incorrect, inefficient, etc.
If the above set of nested datasets is installed from datalad install cryo-data
, then to answer your question, the user chooses different repositories by datalad get org/PROMICE
.
Also, because everything is nestable, users could just datalad install PROMICE
and skip the cryo-data/org/
part of the tree.
It might be worth doing a concrete example such that we really understand what's going on, and how this can be made as simple as possible.
I agree. I will create ~5 datasets as I imagine them above. This will also let us start concretely addressing the #6 metadata issue. Anyone else who wants to can also do the same, and we can then discuss/compare/contrast different approaches.
At our launch meeting the suggestion to organize by DOI, and then link from other organize folders to the main DOI folder. This issues is to discuss and refine this suggestion.
Base folder organization
10.1175/1520-0442(2004)017<1123:RVIUDT>2.0.CO;2
? On my Linux box from trial-and-error, all those are valid in a directory name except/
. I'm not sure about POSIX-compliance yet though. Nor other OSes./
). We should define a regex to fix this. Is there a character that can be used to replace the OS-invalid characters in such a way that we can revert back to the true DOI? For example, if space (`) is not valid in a DOI, then replacing
/with
` allows the operation to be undone (note that spaces in folder names is a pain). Perhaps we store things using a DOI-ish string that cannot be reverted back to the original DOI?db
, ordata
, ormain
rather thandoi
?Secondary (organizational) folders
Linking (
ln
on Linux, Aliases on OS X, something else on Windows) from other organizing folders is an OS-specific operation. Perhaps we should be cloning the repositories from thedb
folder (name TBD) to the other folders? The workflow would then be generic and custom ingest scripts to build a largedb
folder of datalad datastets, then tag (with metadata?) which other folders each dataset should appear in and under what name, and then a simple script that clones fromdb/foo
toauthor/bar_YYYY
andproject/BazName
after anything new gets added to thedb
folder (?).Which other folders should we use? I suggest at least:
However, within these folders we need meaningful names. How do we decide these? For example, my total mass balance is
db/10.22008_slash_FK2_slash_OHI23Z/
. But what do I name it elsewhere and how do we decide on those names? Presumablyauthor/mankoff_2021
, but what aboutproject/PROMICE/TotalMassBalance
or is it alsoproject/PROMICE/mankoff_2021
?Perhaps the point of the metadata search capabilities is to solve this.