datalad

Very powerful because directly based on git-annex, but I still haven't fully understood how to use it properly/efficiently. Datalad is a data management system, and only that (to my knowledge). Very efficient because concentrated on this one task, but somehow limits our application. Or calls for the use of other tools in combination. Which might just be ok.

intake

Simple set of tools but also powerful. Because simple, the community could easily contribute new catalog entries (through yaml files).

Allows for local file caching
Dask capabilities for big data
Cloud access support
Possibility for a simple GUI
Storing catalog metadata in files makes the structuring of our portal very easy to understand and efficient.
The use of the yaml format makes community contribution easier, even from non coders (json and more xml can be intimidating if not used to coding at all).

Intake is more than just a data management tool. Not only the data download step is streamlined but also the reading through the many drivers available (and easy to implement new ones).

pooch

Simple and similar to intake, instead data sources are not really considered as catalogs. Developed to download test data for libraries so we might see some limitations for our metadata portal.

This comparison will be further modified/refined.

AdrienWehrle commented 2 years ago

I see a couple of points that are important to consider for our choice of tool:

Is the simplicity of our backend important? For us, for the users?

a simple backend tool

is easier for the dev team to setup and maintain
makes it easier for the community to contribute to (new catalogs in the case of intake)

Datalad is very powerful, but 99% of the users would probably not understand how the metadata portal actually works. The use of a CLI (and not only a GUI) might therefore be limited to only a small fraction of the community.

Intake is powerful but also simple, it is easy-ier for the users to grasp. Because intake is linked to e.g. Dask, analysis and visualisation is a natural and simple step after the download. Intake is full Python, with all the implications it has.

Is that an important point? I think it is.

AdrienWehrle commented 2 years ago

At present, my choice goes for: Intake

mankoff commented 2 years ago

Not sure why this is an issue and not the discussion (#12) :).

Anyway, maybe we should build a prototype using both to better understand the pros and cons. Suggest initial list is #4, but for our prototype I'm not sure those are good given their large size. How about the following five datasets:

I suggest the prototype include:

A reproducible script (shell, Python, whatever) or Jupyter Notebook or equivalent documenting the steps needed to recreate the prototype. This also acts as one of our goals - a "How To Contribute" document.
Notes on complications/issues
Notes on how the tool (datalad, intake) handles or supports metadata and searching

Ideally, we each build two prototypes, so that we can each understand both tools for a decision/discussion.

cryo-data / discuss

Which tool to load and share metadata? #13

datalad

intake

pooch

Is the simplicity of our backend important? For us, for the users?