Turn metadata extractors into commands?

mih commented 5 years ago

I started RF'ing the metadata code base. I increasingly dislike the special status of the extractors. They are essentially generators that yield JSON-serializable records -- just like any other command. Why not make them regular commands?

We would just need to define a minimal API that any extractor has to be compliant with.

yarikoptic commented 5 years ago

Do you mean to make them datalad commands available through cmdline/python API, or what type of commands?

mih commented 5 years ago

Yes. But I don't care if they become available through the main API. Simply using the same classes is what I care about.

yarikoptic commented 5 years ago

That is what I was having in mind -- I actually do not want them to become available as part of the main API: would make it even slower and more bloated. But reusing more of existing standardization of input API specification is IMHO would be great (I always felt that way while expressing my confusion about new constructs as plugins, procedures etc doing a similar thing but not re-using existing machinery), and extending it with expected output spec (json schemas/validator?) and mix-in in such extractor's API class which they should implement, sounds like a good idea.
I wondered also if right away we could marry our Interface specification with the config. I guess this desire to turn them into commands could have been triggered by desire to make them parametric, e.g. "treat or not derivatives for BIDS dataset", "custom fields to exclude/include while processing DICOMS" etc -- i.e. all that we now hard code. In such cases it makes sense to be able to specify all those not only via cmdline but to be able to prescribe them in the config, since typically those are to persist per dataset.

With that in mind, although possibly somewhat only tangentially related, I think we should also enhance our "API builder" to automate interfacing sub-commands. E.g. similarish to click's groups: https://click.palletsprojects.com/en/7.x/commands/ and somewhat mimicing current "datalad siblings [-s|--name NAME] [ACTION]" behavior.
Here it could be datalad metadata-extractors which would list available and/or enabled extractors if no specific one specified. Actions could be similar to siblings -- "query" (default), "run", and even conveniences like "enable", "disable".

If we had such groupping available, then it could be extended even to the extensions so there is "datalad containers [run/..]." (with list being pretty much the default action "query")

kyleam commented 5 years ago

With that in mind, although possibly somewhat only tangentially related, I think we should also enhance our "API builder" to automate interfacing sub-commands

GitMate.io thinks a possibly related issue is datalad/datalad#2729 (API: composite command).

mih commented 5 years ago

Cool, so we all agree!

Plan:

[x] keep thinking about it
[x] realize that this change will open up the possibility to parametrize extractors (like any other command)
[x] realize that output validation is orthogonal to the purpose/needs of extractors (they are beneficial for any command)
[x] realize that a configuration (storage) strategy can apply to any command, and seems useful in general, i.e. have a helper to pulls all configuration re a specific command from the config in a uniform fashion and naming scheme.
[ ] realize that compound commands are a further, separate, and orthogonal issue ;-)

mih commented 5 years ago

With datalad/datalad#3134 there is now a single API command extract_metadata that takes care of any extraction-related functionality (in contrast to aggregation, and access of aggregated metadata). I am now exploring whether it would be sensible and/or useful to make individual extractors commands too.

yarikoptic commented 4 years ago

Hi @mih. Is that something already supported by datalad-metalad and we should just retag (or just refile and close) into datalad-metalad.

Use case: https://github.com/datalad/datalad-metalad/issues/55 which would need an R stack with many dependencies to extract metadata, so containers come to mind!

datalad / datalad-metalad

Turn metadata extractors into commands? #184