mslw commented 1 year ago

Tuning DataLad catalog

These are some questions that I have after working on catalog generation for a specific project. I don't think any of these are issues (in the sense of something not behaving correctly). Problems are about how to achieve a particular thing (for which I didn't find a way), questions are about interpretation of configuration options.

Problems

Problem 1 (solved): I am listed as the superdataset author, but want to keep myself out of it

The information comes from metalad core (reflects git commit authorship), but I don't want to be listed among authors. Authorship information from metalad core and studyminimeta was merged by default. Solved by changing the source of author information from merge to metalad_studyminimeta in the config. This doesn't seem to affect datasets without studyminimeta files, which is good.

Problem 2: Funding identifier and description

I want this information to come from default extractors & straightforward translation. I tried several ways to specify funding identifier and description in studyminimeta, including like this:

funding:
    -
        name: DFG
        identifier: Project no. 1234
        description: CRC 1451 Motor control in health and disease

but this entire object was extracted as name which was not good.

Yet, this dataset page shows grant identifier, even though its studyminimeta file uses simple comma separation:

  funding:
    - German Research Foundation (DFG), DFG-SFB 1451 (Project-ID 431549029)
    - German Research Foundation (DFG), Research training group Neural Circuit Analysis (DFG-RTG 1960, grant no. 233886668)

which wouldn't work for me.

Problem 3: Project names are cut short

Superdataset's title is shown as SFB1451 - key mechanisms of mo...

Problem 4: Subdataset are listed by folder names

It would be ideal to list them by project titles, but I guess there's little I can do other than find a clever way to supply metadata -- the superdataset, after all, is rendered from its own metadata only. Wontfix? Or is there a simple trick?

Questions

Question 1: Does configuration affect display, or `catalog add`?

At first glance it seems to me that changing configuration option for property_source doesn't change the display of the catalog, unless I repeat catalog add for a given metadata. Which would make sense, as the catalog is static, right?

Question 2: What is the meaning of the `property_source` config option?

It seems to me that it is the following, although I'm not 100% sure:

A single entry: preferred source -- if the source is present, it would be used exclusively; if not then other sources will be used (but is merge the default then?)
A list of entries: alternatives that can be toggled if the field allows (works e.g. for description, but not authors). This is not an order of preferences, right?
merge -- self-explanatory

Question 3: How to update metadata entries

If I want to update (overwrite) an entry for a given dataset version and id (not a common situation in mature workflows, but let's say I fixed a bug in a translator) --- can I run datalad catalog add -c -m again, or should I delete metadata files first?

jsheunis commented 1 year ago

Thanks for going through this effort!

Some comments/answers/questions (I've added a TODO wherever I think there's something that requires a new/separate issue, feel free to add more):

Problem 1 (solved): I am listed as the superdataset author, but want to keep myself out of it

...Solved by changing the source of author information from merge to metalad_studyminimeta in the config. This doesn't seem to affect datasets without studyminimeta files, which is good.

Good that the config works in this case, even if implicitly. It could perhaps be useful to make this more explicit, e.g. what should the config be if not merge but also not a specific source? We could consider making None and option, although it's difficult to think about a practical use case for that (why would a user not want to display anything for a specific field?). Other option could also be to specify exclusions (e.g. not metalad_core). Whatever changes are made, I think it is necessary to document this in detail (readme + docs + docstrings/comments). TODO

Problem 2: Funding identifier and description

...Yet, this dataset page shows grant identifier, even though its studyminimeta file uses simple comma separation:

I'm sorry about this being a bit misleading. I'm pretty sure that in that instance I actually edited the metadata afterwards, moving parts of the name string to the identifier and description fields so that they display like that in the catalog. From the catalog's perspective, there are the 3 fields that can be populated. I constructed these myself based on what seemed to be pretty standard metadata fields for a grant. These can obviously be improved if more useful alternatives exist. But it is the job of the extractor and subsequent translator to populate these, and they can do whatever transformations they need to do with the original data in order to populate the fields that will pass catalog schema validation.

Therefore, to solve your specific problem, the studyminimeta extractor and/or translator would have to be updated.

Problem 3: Project names are cut short

Yes, this is correct. There is a javascript method that only displays the first 30 characters (I think) of a long dataset name. If this is problematic (seems like it could be) this method could be removed, changed, or made part of some configuration (i.e. cut off at character limit vs display full name).

Problem 4: Subdataset are listed by folder names

It would be ideal to list them by project titles, but I guess there's little I can do other than find a clever way to supply metadata -- the superdataset, after all, is rendered from its own metadata only. Wontfix? Or is there a simple trick?

I'm assuming this is in the Files tab, i.e. the file tree view? Yes, datasets would be displayed by their directory names. There's probably some value in displaying either (i.e. by directory name, which would reflect the actual file tree; or by dataset name, which could be more recognisable for humans). Currently, to save resources, the children of a particular node in a file tree are all written to a single file, and that file is fetched (and content rendered) whenever a node is expended. The question is should we then also grab metadata for all children of that expended node, because that would be needed if we want to display their dataset names (assuming such metadata exists and is already added to the catalog) instead of the directory names. Alternatively, their dataset names could be made part of the metadata in the parent node file, which means this would have to be done during catalog entry generation. We should look into whether/how this is possible. TODO

Question 1: Does configuration affect display, or catalog add?

Only the latter. The only parts of the config that, when changed, will cause changes in the rendered catalog are those that affect styling, such as link colours, the logo, etc. What do you think should be the expected behaviour/functionality here?

Question 2: What is the meaning of the property_source config option?

Your understanding is pretty much correct, if my recollection is correct.

A single entry: preferred source -- if the source is present, it would be used exclusively; if not then other sources will be used (but is merge the default then?)

Correct, but merge is not the default. The code says:

# If a non priority source is present
if (config_source != data_source) and (existing_value is None):
    return new_value
else:
    # TODO: figure out if this is expected/ideal behaviour or not
    return existing_value

This means: if the field value is not already set when a new value arrives through catalog entry generation, and if this source is not specified in the config as the single source of metadata, then the field will take on the value on a first come first serve basis. It will only be replaced if a new metadata entry generation process contains a value from the single priority source. We should discuss whether this is expected/ideal behaviour. TODO

A list of entries: alternatives that can be toggled if the field allows (works e.g. for description, but not authors). This is not an order of preferences, right?

Correct, a list that can be toggled if the catalog field allows. Not explicitly in order of preference, but they are rendered in the order in which they appear in the list. The desired order can be made more explicit during catalog generation, if we think that's a useful feature. TODO

Question 3: How to update metadata entries

The standard mode of operation is an update (i.e. not overwrite) of the content in the metadata files. There is a --force argument, but I would not currently trust that it behaves as expected under all circumstances. So, to be sure, currently I would delete and then regenerate the entries. But I think the --force or overwrite flag should be revisited and should allow overwriting of metadata files if specified by the user. TODO

There's a related issue about updating metadata content that provides wider context and highlights further challenges, that could be interesting for you to read. Not directly influencing this particular feature though.

mslw commented 1 year ago

Thanks for a detailed answer, it's very helpful. My comments below, I tried to quote liberally but not excessively, hope it's still readable.

Problem 1 (solved): I am listed as the superdataset author, but want to keep myself out of it

Good that the config works in this case, even if implicitly. It could perhaps be useful to make this more explicit, e.g. what should the config be if not merge but also not a specific source? We could consider making None and option, although it's difficult to think about a practical use case for that (why would a user not want to display anything for a specific field?). Other option could also be to specify exclusions (e.g. not metalad_core). Whatever changes are made, I think it is necessary to document this in detail (readme + docs + docstrings/comments). TODO

Since there are already options for preferred (single value) and several togglable (list), another option could be, as you write, exclusions. Alternatively, an ordered list of several preferred sources, to be considered in this order (not sure how to represent it in json config). But I think that with existing options, anything additional would serve only very narrow situations.

FTR, the reason for me excluding myself from author list is that I use metalad core extractor to get subdatasets of a superdataset that I created only for the purpose of having the catalog's homepage. Although I agree that curators deserve authorship credit in general, I want to remain "transparent" in this case of catalog-making only (and use a more "official" author list instead).

Problem 2: Funding identifier and description

I'm sorry about this being a bit misleading. I'm pretty sure that in that instance I actually edited the metadata afterwards, moving parts of the name string to the identifier and description fields so that they display like that in the catalog.

Perfectly understandable, I thought that's the case but wanted to make sure.

For my catalog I am trying to avoid manual corrections if possible, and limit myself to extract -> translate -> add (maybe I shouldn't be so rigid?).

From the catalog's perspective, there are the 3 fields that can be populated. I constructed these myself based on what seemed to be pretty standard metadata fields for a grant. These can obviously be improved if more useful alternatives exist. But it is the job of the extractor and subsequent translator to populate these, and they can do whatever transformations they need to do with the original data in order to populate the fields that will pass catalog schema validation.

Therefore, to solve your specific problem, the studyminimeta extractor and/or translator would have to be updated.

Agree. Although now my real question would be whether studyminimeta specification allows a detailed description - it seems to me that it doesn't.

In fact I wrote my own translator for studyminimeta, but I tried to maintain parity with your translator which is a part of catalog workflow (in fact, I reused many of the jq calls). Now, I could make it so that my translator would split the extracted "name" by commas (or any other characters) into the three fields, but I feel that would be too "opinionated" for a translator, and an abuse of the minimeta specification.

I was also thinking of coming up with my own extractor/translator, for my own custom metadata files (which would have fields matching the catalog schema), so that I could inject the information I like (funding details) but still have it coming 100% from meta-extract. But I'd rather stay with defined metadata standards.

Does any of the 2 approaches above sound better to you?

As a side note, I don't know any specific standards for grant metadata, but I think it's a good idea with name, identifier, and description. Also, it's nice to be explicit when filling it in (regardless of file formats), considering that according to this blog post e.g. the acronym ‘NSF’ yields RORs for six organizations.

Problem 3: Project names are cut short

Yes, this is correct. There is a javascript method that only displays the first 30 characters (I think) of a long dataset name. If this is problematic (seems like it could be) this method could be removed, changed, or made part of some configuration (i.e. cut off at character limit vs display full name).

I think 30 is quite short - good for an overview, less ideal if you want a prominent display. I think there is room for more in the page for most window sizes, and having an option to control that would be great, if it's easy to add. TODO?

Problem 4: Subdataset are listed by folder names

I'm assuming this is in the Files tab, i.e. the file tree view? Yes, datasets would be displayed by their directory names. There's probably some value in displaying either (i.e. by directory name, which would reflect the actual file tree; or by dataset name, which could be more recognisable for humans). Currently, to save resources, the children of a particular node in a file tree are all written to a single file, and that file is fetched (and content rendered) whenever a node is expended. The question is should we then also grab metadata for all children of that expended node, because that would be needed if we want to display their dataset names (assuming such metadata exists and is already added to the catalog) instead of the directory names. Alternatively, their dataset names could be made part of the metadata in the parent node file, which means this would have to be done during catalog entry generation. We should look into whether/how this is possible. TODO

Looks non-trivial, we'd probably need to store metadata in the subdataset and make it bubble up to the superdataset? I don't have particular ideas, and wouldn't call it a priority.

Question 1: Does configuration affect display, or catalog add?

Only the latter. The only parts of the config that, when changed, will cause changes in the rendered catalog are those that affect styling, such as link colours, the logo, etc. What do you think should be the expected behaviour/functionality here?

I think the current behavior is the expected behavior, and have no suggestions here. I mean my first thought was that maybe the display would change if I changed the config, but since the page is meant to be static it makes sense that we apply the configuration when doing datalad catalog add.

Question 2: What is the meaning of the property_source config option?

(...) This means: if the field value is not already set when a new value arrives through catalog entry generation, and if this source is not specified in the config as the single source of metadata, then the field will take on the value on a first come first serve basis. It will only be replaced if a new metadata entry generation process contains a value from the single priority source. We should discuss whether this is expected/ideal behaviour. TODO

I think this is logical, and as stated before, the only missing part could be an explicit explanation in docs / docstrings.

Question 3: How to update metadata entries

Thanks for the explanation.

jsheunis commented 3 months ago

Closing since this issue served its purpose.

datalad / datalad-catalog

Questions: fine-tuning the catalog #211

Tuning DataLad catalog

Problems

Problem 1 (solved): I am listed as the superdataset author, but want to keep myself out of it

Problem 2: Funding identifier and description

Problem 3: Project names are cut short

Problem 4: Subdataset are listed by folder names

Questions

Question 1: Does configuration affect display, or `catalog add`?

Question 2: What is the meaning of the `property_source` config option?

Question 3: How to update metadata entries

Problem 1 (solved): I am listed as the superdataset author, but want to keep myself out of it

Problem 2: Funding identifier and description

Problem 3: Project names are cut short

Problem 4: Subdataset are listed by folder names

Question 1: Does configuration affect display, or catalog add?

Question 2: What is the meaning of the property_source config option?

Question 3: How to update metadata entries

Problem 1 (solved): I am listed as the superdataset author, but want to keep myself out of it

Problem 2: Funding identifier and description

Problem 3: Project names are cut short

Problem 4: Subdataset are listed by folder names

Question 1: Does configuration affect display, or catalog add?

Question 2: What is the meaning of the property_source config option?

Question 3: How to update metadata entries