gbif / ipt

GBIF Integrated Publishing Toolkit (IPT)
https://www.gbif.org/ipt
Apache License 2.0
125 stars 58 forks source link

Make the metadata projectID field multivalue #1927

Open marcos-lg opened 1 year ago

marcos-lg commented 1 year ago

This suggestion comes from this issue: https://github.com/gbif/pipelines/issues/836#issuecomment-1373986330_

This change implies to modify the dataset API in the registry to return a list of projects instead of one as we are doing now. We should plan it in advance and let users know. Also, pipelines has to be modified, otherwise it will break. We should also check if it breaks other projects that use the dataset API.

camiplata commented 1 year ago

@marcos-lg Thank you for creating this issue. Some context about this need:

We have many publishers that update their datasets almost yearly, in particularly biological collections and monitoring proyects. Data from this sources have been financed under multiple proyects across the years, but due to the nature of the data keeping only one dataset is desirable. Thus, we came across the need of having a better way of traking these multiple fincatial sources and proyects on the metadata.

One idea that came from working with collections under the BID proyects was to have a multivalue proyectID at the occurence level, an idea that can be expanded at the metadata level.

Nevertheless, after thinking further about the actual need at the metadata level, a posible solution could be having a repetable project data section, like taxonomic-coverage were you can add multiple coverage. The option of adding multiple proyects can be more transparent as it allows to document all the details of any given proyect including its specific ID.

I would like to know your opinions about this.

ahahn-gbif commented 1 year ago

Just a point for consideration: we are somewhat limited by the options that the EML metadata schema allows us (https://eml.ecoinformatics.org/schema/index.html, https://eml.ecoinformatics.org/eml-schema.html). I may be a bit rusty in reading this, but it does not seem as if the <project> module is repeatable. Unless I am misreading this?

EML 2.2.0 did add, amongst other things, an <award> element to support structured funding information within the <project> module. Combining use of that with a multivalue projectID does sound risky though.

MattBlissett commented 11 months ago

The project element is limited to one, though it includes a relatedProject element, described as "This field is a recursive link to another project. This allows projects to be nested under one another for the case where one project spawns another.", and you can have as many of those as you like. Do you think that would be appropriate?

https://eml.ecoinformatics.org/schema/eml-dataset_xsd.html#DatasetType_project

Relevant for https://github.com/gbif/eml-profile/issues/5

dbloom commented 11 months ago

@MattBlissett No solutions here, just some practical input. This addition to the IPT would be most welcome. As a BID trainer/mentor I run into this issue in every workshop, as well as occasional scrapes with it in the standard publishing pipeline. Most recent was a resource published during the recent Asia Mobilization Workshop. This resource from Vietnam - https://cloud.gbif.org/asia/resource?r=vietnam200endemicorendangeredplantspecies

You will see the ProjectID is actually a funder ID of some sort (ĐTĐL.CN-58/19), as this publisher was obliged to recognize them. The best solution that the Asia Helpdesk could suggest was to add the other projectID (I assume the BID or workshop projectID) to the metadata as an alternative identifier. Not a bad solution, but certainly not ideal - I do not know if these IDs will be searchable in all locations - the BID/workshop ID was given by LARussell and was something like EV-AsiaMobilization2023, I just don't remember, but I don't see it in the resource metadata and I am not an admin on the Asia Cloud IPT so I can't look.

In any case, if the IDs can be nested so that everyone with a stake in the project can be recognized, I think that would be a good solution. We would need a GBIF policy/recommendation that describes which projectID should take precedence for the top slot and how nesting should be prioritized. Or, perhaps the "award" element could be applied in some manner, per @ahahn-gbif's thinking above. Regardless, I will welcome this update.

ahahn-gbif commented 11 months ago

Present internal use of the projectID in the context of GBIF is projects with GBIF-mediated funding (BID, BIFA & Co). The projectID, in these cases, is required to link the deliverable (dataset) to the project page, like e.g. here. Project pages only ever get generated for these projects with GBIF-mediated funding. For such datasets, we make it mandatory to list the BID/BIFA/other projectID as the one and only entry in that field. Apart from this, a facet filter allows a general search for projectIDs, including outside of the funded project context.

Using the relatedProject nesting could be a pragmatic solution for most cases, even though real-life cases will not necessarily fall within the "spawned other" definition that @MattBlissett cites above. There will inevitably be datasets that received funding through different projects, or though more than one agency in parallel, contesting the hierarchy. This decision will also need to involve UI considerations on the side of GBIF.

If we do want to allow to credit different funders, I still think adding the awardType complex Type might be the better choice.

dbloom commented 11 months ago

@ahahn-gbif I think I agree. I often get questions about how to include funders regularly. I know it has been a concern for funders such as the JRS Foundation for quite a few years. I realize these instances exist only a small subset of all of the datasets published to GBIF, but I think the flexibility is important.

@MattBlissett I hope all of this is helpful is some way. If you would like other perspectives I could probably rally some.

RicardoOrtizG commented 11 months ago

I agree with @dbloom , it's a constant demand for publishers that became more evident during the BID projects, especially in collections. We implement a partial solution involving a metadata resource, which generates confusion and increases the intention of other publishers to publish only metadata and no data.

I think the idea of @MattBlissett covers a segment of the problem but, as @ahahn-gbif says, the most current issue is that collections could be funding for many projects. We also think about the possibility of having a "DWC project extension" that points to a "projectID" element in the occurrence core.