cryo-data / discuss

Discussions for cryo-data
3 stars 0 forks source link

Which tool to load and share metadata? #13

Open AdrienWehrle opened 2 years ago

AdrienWehrle commented 2 years ago

We initially explored datalad, but other options are very interesting too:

datalad

Very powerful because directly based on git-annex, but I still haven't fully understood how to use it properly/efficiently. Datalad is a data management system, and only that (to my knowledge). Very efficient because concentrated on this one task, but somehow limits our application. Or calls for the use of other tools in combination. Which might just be ok.

intake

Simple set of tools but also powerful. Because simple, the community could easily contribute new catalog entries (through yaml files).

Intake is more than just a data management tool. Not only the data download step is streamlined but also the reading through the many drivers available (and easy to implement new ones).

pooch

Simple and similar to intake, instead data sources are not really considered as catalogs. Developed to download test data for libraries so we might see some limitations for our metadata portal.

This comparison will be further modified/refined.

AdrienWehrle commented 2 years ago

I see a couple of points that are important to consider for our choice of tool:

Is the simplicity of our backend important? For us, for the users?

a simple backend tool

Datalad is very powerful, but 99% of the users would probably not understand how the metadata portal actually works. The use of a CLI (and not only a GUI) might therefore be limited to only a small fraction of the community.

Intake is powerful but also simple, it is easy-ier for the users to grasp. Because intake is linked to e.g. Dask, analysis and visualisation is a natural and simple step after the download. Intake is full Python, with all the implications it has.

Is that an important point? I think it is.

AdrienWehrle commented 2 years ago

At present, my choice goes for: Intake

mankoff commented 2 years ago

Not sure why this is an issue and not the discussion (#12) :).

Anyway, maybe we should build a prototype using both to better understand the pros and cons. Suggest initial list is #4, but for our prototype I'm not sure those are good given their large size. How about the following five datasets:

I suggest the prototype include:

  1. A reproducible script (shell, Python, whatever) or Jupyter Notebook or equivalent documenting the steps needed to recreate the prototype. This also acts as one of our goals - a "How To Contribute" document.
  2. Notes on complications/issues
  3. Notes on how the tool (datalad, intake) handles or supports metadata and searching

Ideally, we each build two prototypes, so that we can each understand both tools for a decision/discussion.