IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
875 stars 484 forks source link

Life Sciences metadata #6359

Closed sallyjtaylor closed 3 years ago

sallyjtaylor commented 4 years ago

The Dataverse North Metadata group has suggestions to improve the Life Sciences metadata block.

  1. Add "Other" to the drop-down menu for "Design Type". (Note: there is "Unspecified" which has a different meaning).
  2. Add "Other Design Type" box.
  3. Add "Other Factor Type" box.
  4. Add "Other Technology Type" box.
  5. Add "Other Technology Platform" box.
  6. For consistency, list Other at the end of all drop-down menus (vs. alphabetical as in Factor Type).

Thank you, Sally

johnhuck commented 3 years ago

We are still interested in seeing this move forward. Here is a bit more about the motivation for the feature request. The intention is to ensure that every drop-down field in the Life Sciences block:

The Organism element is set up like this now.

From a coding perspective, the work involves adding 4 new metadata elements for the free text boxes, an "Other" value in one of the controlled lists, and reordering values in another.

jggautier commented 3 years ago

Hi @johnhuck. Would you or a colleague be interested in editing the biomedical metadatablock and submitting a PR?

The only other thing that I think would need to be done is include the usual metadatablock update instructions in the Dataverse release notes, which I think @djbrooke takes care of. Is that right, @djbrooke?

Also, now that Dataverse metadata form uses a field component that makes it easier to find a term from a large list of terms (a combination of a search box and dropdown list), do you think there are additional terms that could be added to the vocabularies in the biomedical metadatablock? For example, I took a look at the vocab terms in the "Organism" field and what depositors from 36 Dataverse repositories enter in the "Other Organism" field, and I wonder if a lot more terms could be added to the Organism vocab.

Lastly, maybe this will be addressed in another github issue (maybe one related to better support for controlled vocabs) or addressed by the planned metadata working group, but if the field component let people both choose terms from a controlled vocab and enter their own terms I think it would be a lot better UX for the depositor and better for exporting this metadata in other standards (ISA-TAB?) if that's ever done.

johnhuck commented 3 years ago

Hi @jggautier!

Yes! @amberleahey has indicated that Scholars Portal is interested in creating a pull request for this work, and I was updating this issue partly at her request to set the stage for that. Our Dataverse North metadata group is working on the Life Sciences section of our best practices guide right now, so this is another reason we are thinking about these fields.

Here are a few other thoughts and responses to your questions.

1) Expanding the existing set of terms in the controlled lists is certainly something to consider, although it could/would be a large task. I think it should be a separate process, and maybe something for the new metadata group to tackle. The scope for potential changes is vast, if "life sciences" is interpreted broadly. These changes would add practical flexibility in the meantime.

2) I think larger questions may come up in the process of revising the lists. I think it's true that the terms are taken from a variety of sources (that's what I conclude from the presence of various term identifiers listed in the metadata documentation, e.g., NCBITaxon_6239 for Caenorhabditis elegans). Is that right? If so...

2a) I think the lineage/source of the terms could be more transparent in the metadata itself (so that context travels), like it is for Keyword and Topic Classification. I'm sure there would be multiple approaches to consider for implementing this. I'm thinking of the term identifier (whether IRI or not) and identifying the source vocabulary (especially when the identifier is not an IRI). In general, working with term identifiers is best for reducing ambiguity, but humans need the strings. Handling the pair is always the issue.

2b) We discussed in our group whether the free-text fields for "Other" terms should actually be 2 or 3 fields, again, like for "Keyword" (one for the string, one for the term identifier and one for the term source). And there was support for this idea in our group. I personally think it would be better to set this idea aside for the moment, and consider it later in the broader context of improving term transparency I mention above. If this is tackled later, it keeps the current work relatively focused. But this may be a point that others have differing opinions on, so it's open for discussion.

2c) Judging from the term identifiers, some fields mix and match (e.g., studyDesignType) terms from different sources, which somewhat goes against the principle of controlled vocabularies, which is that they are "controlled" for things like redundancy, conceptual overlap, comprehensive domain coverage, etc., so as to be internally consistent. I wouldn't want to say it's impossible for two or more to be combined, but I think it's worth doing thoughtfully, with some analysis to ensure that goals and needs are being met. This analysis would then provide a basis for making consistent decisions in the future, in the event that the arrangement/implementation needs to be modified. Again, I think this type of question is rather large, and is the type of question to take up in the community metadata group, I think.

3) FWIW In the guide we are working on, we have identified this vocabulary as a recommended source of terms for other cell types: https://www.ebi.ac.uk/ols/ontologies/cl

Cheers!

lubitchv commented 3 years ago

Hi @johnhuck and @jggautier Scholars Portal is indeed interested in making pull request for this issue. I am going to do that.

lubitchv commented 3 years ago

Hi @johnhuck and @jggautier

I am adding "Other" boxes. What should I put as tooltip descriptions for these fields? For example, "Other Organisms" have: "If Other was selected in Organisms, list any other organisms that there used in this dataset. Terms from NCBI Taxonomy are recommended". So for example, for "Other Design Type" box I can put: "If Other was selected in Design Type, list any other design types that there used in this dataset. " What should be for recommended terms? This is also a question for "Other ..." boxes.

Also what should be an identifier for "Other" field in "Design Type"? "OTHER_DESIGN" is already taken by "Not Specified".

jggautier commented 3 years ago

@johnhuck, is it safe to assume that the identifiers for the "Not Specified" value in "Design Type" fields, as well as identifiers of the existing "Other" values in other fields, like "OTHER_FACTOR" and "OTHER_MEASUREMENT", were created by Dataverse and don't come from any published controlled vocabulary? I found this Google Doc listing what looks like potential values for the Study Type field and followed some of the links to the BioPortal's list of ontology classes and couldn't find these "Other" values.

If Dataverse made up the identifiers for these "Other" values, I wonder if (for now) it would be okay to adjust these identifiers in the Life Science metadatablock, such as:

I'm looking into if Dataverse considers these identifiers when importing/harvesting these fields, and if changing the existing identifiers would make Dataverse fail to index values whose identifiers are different than the identifiers in the metadatablock of the Dataverse repository doing the importing. I wouldn't think so since these identifiers don't show up in any metadata exports.

About your question in your comment from last week:

I think larger questions may come up in the process of revising the lists. I think it's true that the terms are taken from a variety of sources (that's what I conclude from the presence of various term identifiers listed in the metadata documentation, e.g., NCBITaxon_6239 for Caenorhabditis elegans). Is that right? If so...

I think so to. They're taken from a variety of sources. The idea that "the lineage/source of the terms could be more transparent in the metadata itself (so that context travels)" is related to the push for Dataverse to follow linked data principles.

johnhuck commented 3 years ago

Hi @lubitchv and @jggautier ! :-) Glad that we will be working on this together.

As a general comment: I wonder if for now we should set aside any questions of recommending vocabularies for the "Other" fields and just add tooltips like "If Other was selected in Design Type, list any other design types that are used in this dataset."

The bottom line goal for this PR in my mind is to give users the ability to say something that is not in the lists. Users may or may not follow a recommendation anyway (my guess is most of the time they won't, and I don't necessarily see that as a bad thing). A free text field doesn't prevent them from using a term from a vocabulary either.

Recommending sources for terms goes beyond the basic goal for this PR. It is a much bigger task. I would want to put that question to a working group that could investigate, consult, debate options, etc. I think going down that path will inevitably mean looking at the bigger questions I raised last week of where the terms currently in the lists come from in the first place for these fields, the advisability of mixing from multiple vocabularies in the first place, etc. I think it will be easier to recommend vocabularies once those questions are resolved, the original strategy that shaped the formats of these lists is better understood and a current strategy agreed.

Julian, I was hoping that you would know the answers to the questions you ask, lol! (about terms originating with Dataverse itself).

I like your suggestion to change the identifier for "Not Specified" so that it matches, so that the "OTHER_DESIGN" identifier is freed up. As long as the database stores the names/values and not identifiers, this would would not affect any existing records that include that value.

Incidentally, regarding BioPortal: our WG has looked at it. The thing about it is that it is not a single vocabulary, it is a registry/database of hundreds of ontologies, some of which you could call vocabularies. Maybe that's a discussion for another day!

jggautier commented 3 years ago

I agree with the tooltip for design type: "If Other was selected in Design Type, list any other design types that are used in this dataset." And that recommending other sources for terms could be addressed another time outside of this issue.

Julian, I was hoping that you would know the answers to the questions you ask, lol! (about terms originating with Dataverse itself).

Ha, I understand. I couldn't find any past discussion about where these "Other" terms and their identifiers came from, but was hoping you or a colleague or someone more familiar with these vocabularies would know, or know how to navigate BioPortal better than I can to find out. I'll try to find out who was consulted about using these terms and see if they can say for sure.

johnhuck commented 3 years ago

Another thought: @jggautier, if you weren't finding an "other" value in those ontologies, that makes sense to me, because with an ontology, where you are modelling a domain with classes and individuals, I wouldn't think you would create a class for "other". You would probably rely on sub/super-classing to make a more general assertion about a method or whatever. Not a definitive answer to the question, but just an observation.

jggautier commented 3 years ago

I agree. Thanks! All of the classes I've found with a label that includes "Other" includes sub-classes (like other acquired skin disease). And there are many other identifiers in the life sciences metadatablock that seem like Dataverse created them. I'm reaching out to others, but I'm inclined to say it should be okay for this issue to change the identifiers for the "Other" terms we've mentioned.

@johnhuck and @lubitchv, does that sound good?

johnhuck commented 3 years ago

It's only studyDesignType where we need to change/add a list value.

Besides that, the display order numbers for values in the studyFactorType needs to be renumbered to make "Other" last (i.e., =19).

Otherwise, it's just adding 4 free-text fields (i.e., "boxes").

johnhuck commented 3 years ago

As long as we can confirm that it's the "title" (i.e., "Not specified") (as it's called in the Dataverse Metadata v4.x spreadsheet) that is recorded in dataset metadata (in the database) and not the "identifier" (i.e., "OTHER_DESIGN"), then I see no problem with changing the identifier associated with "Not specified"). I think this is the case.

But if this is not the case, then we shouldn't change the identifier, because it would change the meaning of any metadata records that record it as a value. Does that make sense?

jggautier commented 3 years ago

Yes, makes sense to me. Both the titles (called strvalue in the database) and their identifiers are stored in the database, in the controlledvocabularyvalue table:

Screen Shot 2020-09-17 at 4 33 34 PM

If an identifier is changed in a metadatablock TSV, I think in the process of following the usual instructions for updating existing metadatablocks, the identifiers are changed in that database table in the repository's database as well. And I think in the database, a dataset's metadata is linked to the controlledvocabularyvalue.id, and not the controlledvocabularyvalue.identifier, so changing the identifier shouldn't matter.

But I'll ping @scolapasta about this. (Ping! =) )

lubitchv commented 3 years ago

@jggautier Yes, I agree with you. I also looked at the database and the code. The identifier and titles are recorded in the database but dataverse displays title (not identifier). Also export json of metadata contains title and not an identifier. So reloading metablock and solr should be enough in that case.

johnhuck commented 3 years ago

@jggautier I didn't have the right words to describe the inner workings, but I think you took my meaning.