IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
857 stars 481 forks source link

Please add Darwin Core support #6243

Open kamil386 opened 4 years ago

kamil386 commented 4 years ago

I downloaded the spreadsheet https://docs.google.com/spreadsheets/d/1P9xvaRLhCKsYmjz9eXXVl0T9d2U34UgynbvxDp-2Bjc/edit#gid=1331272861 as TSV

and run: curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file /tmp/Comparative\ Zoology\ _\ Darwin\ Core\ Metadata\ -\ Sheet2.tsv but received error response: {"status":"ERROR","message":"For input string: \"\""}

That custom metadata block for Darwin Core does not work. Comparative Zoology _ Darwin Core Metadata - Sheet2.zip

jggautier commented 4 years ago

Hi @kamil386. I don't know what this error message means.

There are a few issues I see with the spreadsheet you linked to, but I don't know if any of them are what's causing that error response:

This Github issue title first made me think you were asking if Darwin Core could be added as a metadata block that comes with Dataverse (instead of a custom metadatablock that an installation would add). Is that the case? Or are you only asking for help to add the metadatablock to a Dataverse installation you're setting up?

djbrooke commented 4 years ago

Thanks @jggautier !

kamil386 commented 4 years ago

I were asking if Darwin Core could be added as a metadata block that comes with Dataverse, more precisely if Dataverse could support Darwin Core natively. Dataverse with DarwinCore could be much more easily adopted by many public Institutions dealing with Biodiversity. It will be also nice to mark as completed one of the planned feature :) https://dataverse.org/files/dataverseorg/files/iassistposter2016ecastro.pdf

But it would be a good starting point if adding this custom medatablock to a Dataverse installation would work, even for testing purpose in the meantime.

pdurbin commented 4 years ago

The list of list of five metadata blocks that ship with Dataverse at http://guides.dataverse.org/en/4.16/user/appendix.html has been frozen in time for a while. Recently for #3976 we added a six to that list, a "Journals" metadata block which has been available for a while but never included in the list.

The idea was never that the list of metadata blocks that ship with Dataverse would be frozen. The idea was that we'd put a few out there that are important to our original user base (social science) and others that we have some experience with (astronomy, etc.) to show what's possible. Then we would work with the community to add addition metadata blocks as "official" blocks that ship with Dataverse.

@kamil386 my question to you is, is the Darwin Core support that you found in that spreadsheet good enough? Should we ship it? Or do you think it needs more work? Thanks!

jggautier commented 4 years ago

@kamil386, I thought at first that you created that spreadsheet, but @pdurbin made me take a second look and I see now that someone on the Dataverse team did. Sorry about that. I second @pdurbin's interest in learning what you think about the metadata. It seems to be a subset of Darwin Core. Do you think it's an appropriate subset? Are fields missing?

Also, were you trying to add it to your own installation so that you could see what it looks like in the UI?

kamil386 commented 4 years ago

First of all, thank you for your interest in this topic.

I think that the Natural History Museum in London have the reference DarwinCore subset of fields we should follow, as they have probably the biggest collection of specimens in full range of biodiversity (Botany, Entomology, Zoology, Palaeontology, Mineralogy). NHM have choosen 71 fields from DarwinCore, but based on my knowledge even they didn't use all of them.

The scientists around the Bialowieza Forest have some precious and unique collection of specimens they want to digitalize. They have worked on DC schema since some time and haven't yet choosen sufficient subset of fields.

@pdurbin @jggautier This DC subset isn't good enough and it needs more work, can we postpone this issue for a while? I think that we can collaborate together on the appropriate subset. What's most important, the Dataverses logic with templates allows further to choose another subset of subset that will suit better for adding customized metadata i.e to collections of mushrooms or skulls.

@jggautier I were trying to add it to our installation, but due to this string error any of the fields didn't appear in the UI. As you notice, we should probably keep the number of fields to the necessary minimum for UI, what do you think about that?

We have got also plans to build additional tool on top of the Dataverse that will be useful for Biodiversity portals, that will also follow the NHM approach. @pdurbin @jggautier If the fields from the new DarwinCore metadata will be potentially searchable by API or OAI-PMH?

pdurbin commented 4 years ago

@kamil386 thanks for your continued interest in Darwin Core support! I told @jggautier last week and I meant to tell you that you shouldn't feel like you have to clean up the proposed metadata block alone. I would suggest emailing https://groups.google.com/forum/#!forum/dataverse-community and asking if anyone would like to help you take a look at the speadsheet you found and help you make that Darwin Core custom metadata block production ready.

Yes, Darwin Core fields will be searchable. No, fields will not be harvestable via OAI-PM (some future work on the code side would be needed, please feel free to open an issue with a title like "allow fields from custom metadata blocks to be harvestable via OAI-PMH"). Thanks!

pdurbin commented 4 years ago

@kamil386 hi! I'm just checking in. If you're willing, I still think it would be good for someone to email the dataverse-community list to try to get some discussion going.

I did mentioned Darwin Core in a recent post I made to that list called "custom metadata blocks now easier to spin up and evaluate" at https://groups.google.com/d/msg/dataverse-community/uKretKox_io/4FyPVAMYBgAJ

kamil386 commented 4 years ago

Olga Kurek from Mammal Research Institute in Bialowieza (MRIPAS) created a DarwinCore Schema and after some tweaking, testing and multiple rollback of DB I hope it should be now production ready :)

Version without groups: https://docs.google.com/spreadsheets/d/1P2y8Kz9pDJlZhPiZiT5EoJAeZGvUOPR1gZh8T3gIMw0/edit?usp=sharing

Version with groups (parent): https://docs.google.com/spreadsheets/d/1p_myNEdbV-afBaF7D-I__-3CyybY8oiR3zP2VsyrKqU/edit?usp=sharing

Why it isn't possible to deselect single fields in the group of fields (the same parent). If this is by Dataverse design, we would like to contribute the DwC version without groups. Could you ship the new metadata schema block in some future release of Dataverse?

@jggautier And here is a screenshot how the DwC schema looks like in Dataverse UI: image

jggautier commented 4 years ago

Great news! Thanks @kamil386 and Olga!

Why it isn't possible to deselect single fields in the group of fields (the same parent).

Could you write more about what you mean?

I wonder if some of the fields are duplications of existing fields, such as License and Rights Holder, and if this duplication will confuse depositors and lessen the amount of metadata that Dataverse exports in other metadata standards (or make mapping to those standards a little more complicated).

kamil386 commented 4 years ago

I mean that you can't deselect single subfield from the "group" (strictly parent field), as shown on the printscreen. I can select/deselect the whole group "Geographic Coverage" with all the subfields belonging to the group, but I can't deselect i.e "Geographic Coverage Other" subfield from this group. There is no possibility to select only a few necessery subfields from the group of i.e 50 subfields. The workaround is to use DwC schema without group/parent if Dataverse can't handle this case.

image

Yes, License and Rights Holder and a few other are duplicates (not technically - names are globally unique in SOLR) or similar of existing fields, but we think it should be compatible and consistent with original DarwinCore schema. Of course, due to the great feature of Dataverse, users will be able to create their own subset of fields for their datasets, regarding to their needs. We are open to discuss and find the best solution.

jggautier commented 4 years ago

Thanks for the screenshot. I see what you mean now about not being able to deselect a subfield. So in the example you gave, you imagine that a dataverse admin might want to hide/deselect the City field, so that dataset depositors don't see that field in the Geographic Coverage "parent" or compound field.

Screen Shot 2020-01-21 at 10 50 51 AM

You're right - it's not possible right now to deselect or hide a subfield of a parent field, and I couldn't find a GitHub issue that requests this functionality, so perhaps we could open an issue? I could see why it would be important for this Darwin Core metadata block, since one parent field has 21 subfields, and another has 44, and a depositor could be overwhelmed by the number of fields that she may not need to be concerned about.

But the Coverage group (or compound field) is a good example to talk about how Dataverse knows when each of these subfields are part of the same parent field, in this case Coverage. If each subfield is instead its own parent field, currently Dataverse won't know that a given Country/Nation and State/Province is part of the same Coverage. So the structure of that metadata should look something like:

But instead will be flat, like:

(This structure is actually lost in some of Dataverse's metadata exports, which I consider a bug and I think is reported in other GitHub issues.)

The tsv with the groups or compound fields includes a parent field called Occurrence that has 21 subfields and another parent field called Location with 44 subfields, and multiple values are not allowed for any of those parent fields or their subfields (allowMultiples is set to FALSE), so maybe this won't be a problem for that metadata. That is, maybe losing the relationship between parent and subfields won't be an issue if each dataset is only ever describing one "Occurrence" or one "Location." Does that make sense?

Regarding duplicate fields, I think it's optimistic to think that dataverse owners will know that duplicate fields exist, especially in self-curated repositories, so for example it's optimistic to think that dataverse owners would know that the License field in the Darwin Core metadata block can practically store the same information (or conflicting information) as the CC0 Waiver or Terms of Use fields in the Terms metadata tab, and that they will take that into account when customizing their dataverse's metadata fields and giving their depositors instructions. But this is just my hunch after seeing how metadata is entered in a repository with a lot of "self-curated" datasets.

A more solid problem is that anything entered in the new Darwin Core fields isn't mapped to fields in the metadata that Dataverse exports. For example, right now anything entered in DWC License field won't be included in the related fields in exports like Dublin Core, Schema.org and DataCite. How Dataverse metadata are mapped to fields in other standards would need to be adjusted, and then we would need to decide how to handle cases where someone uses CC-BY in the DWC License field and keeps CC0 in the Terms tab, whose fields Dataverse admins aren't able to hide/deselect.

jggautier commented 4 years ago

Could you write more about the metadata block being being compatible and consistent with the original DarwinCore schema.

kamil386 commented 4 years ago

You're right - it's not possible right now to deselect or hide a subfield of a parent field, and I couldn't find a GitHub issue that requests this functionality, so perhaps we could open an issue?

It would be a great feature so I'll open an issue.

I could see why it would be important for this Darwin Core metadata block, since one parent field has 21 subfields, and another has 44, and a depositor could be overwhelmed by the number of fields that she may not need to be concerned about.

I couldn't describe it more clearly, that's why we need that functionality. We even created the DwC schema without groups and consider using it as a workaround. What's more, even scientists and researchers that will create metadata still didn't finally decide on full range of DwC fields they will use. It seems to me that it's not an easy task with biodiversity data. I think that different data will need a different scope of DwC fields in dataverses. That can be seen on multiple objects in Darwin Core view in NHM Data Portal as example.

(This structure is actually lost in some of Dataverse's metadata exports, which I consider a bug and I think is reported in other GitHub issues.)

Yes it's lost, except the JSON, OAI_ORE and Schema.org JSON-LD.

That is, maybe losing the relationship between parent and subfields won't be an issue if each dataset is only ever describing one "Occurrence" or one "Location." Does that make sense?

The schema with groups is necessary, as then we can easily add another group of fields with one click, and this hierarchy will be reflected in metadata exports (that's why we set that workaround without groups doesn't allow multiple values). We need also some changes in UI, because right now Dataverse print this fields of groups are "flatten" i.e: Geographic Coverage Country Province City Country City

But this is just my hunch after seeing how metadata is entered in a repository with a lot of "self-curated" datasets.

You're absolutely right, thanks for pointing that. DwC schema needs some more work to handle this case and some other if appears, we need to take care of the details. It is true, depositors instructions will not work, there would be a mismatch in this fields.

A more solid problem is that anything entered in the new Darwin Core fields isn't mapped to fields in the metadata that Dataverse exports.

The new DwC schema metadata block is included only in JSON and OAI_ORE. Currently that's enough for us to build some additional tool on top of the Dataverse similar to NHM. For the future it will be nice if DwC (which is extension of DublinCore) would be included in DublinCore metadata export and other, especially for google and machines, which predictably will use that metadata in future..

Could you write more about the metadata block being being compatible and consistent with the original DarwinCore schema.

Olga Kurek copied all the fields from DarwinCore schema (https://dwc.tdwg.org/) to excel spreadsheet, so it's 1:1 match. Categories/Class are mapped as groups/parent in TSV (in "DwC schema with groups"). Only dwc namespace is without colon char, because it's restricted char in SOLR.

jggautier commented 4 years ago

Thanks @kamil386! Is it right to say that you're proposing:

I think I would be okay with both of these things if this is planned to be a metadata block that's added to Dataverse's "standard" metadata blocks.

When you mentioned that the metadata was tested, did that mean tested to make sure it worked technically and didn't cause bugs or tested with depositors or both?

kamil386 commented 4 years ago

@jggautier Thanks for a good summary.

Yes, I'm proposing both and I hope second proposal won't break any compatibilty in the future, but it's not a blocker for us and can wait. It will also mandatory require at least #6588 and preferably #6589, without this we'll need to revert to DwC without groups. Jim Myers recently told me a lot about metadata and we need to change the DwC schema according to his instructions: https://groups.google.com/forum/#!topic/dataverse-community/uKretKox_io

BTW this probably won't be a problem anymore:

Only dwc namespace is without colon char, because it's restricted char in SOLR.

I tested it to make sure it worked technically and didn't cause bugs, but I can arrange some test by scientists with real data.

pdurbin commented 1 year ago

@kamil386 we now list "experimental metadata" in the appendix of the user guide like this: https://guides.dataverse.org/en/5.12/user/appendix.html#experimental-metadata

Screen Shot 2022-10-10 at 10 39 49 AM

Are you interested in advertising the Darwin Core metadata block here? If so, would you like to make a pull request? Thanks.