Sum the number of citations from former datasets

sylvain-morin commented 2 years ago

In some cases, we have to replace an existing dataset by another one.

We can keep track of this change, thank to the "Deleted" and "Replaced by" annotations by GBIF

For example, this one has been deleted https://www.gbif.org/dataset/7732b10e-7845-434a-b1a7-173e10f21d9a, and replaced by this one https://www.gbif.org/dataset/eed49c61-2085-46c4-8d4f-d2cda50fd404.

The former citations will lead to the "Deleted" dataset page, which is preserved, and we have the link to the new dataset page ("Replaced by"). So, former citations will still be valid - they don't lead to a 404.

However, one thing is missing for us: the number of citations is reset for the new dataset page.

For sure, we still have the previous number of citations on the "Deleted" dataset page.

What about to sum the number of citations directly on the new dataset page?

GBIF seems already to be able to do search like this (citation from the 2 datasets, the deleted and the new one): https://www.gbif.org/resource/search?contentType=literature&gbifDatasetKey=7732b10e-7845-434a-b1a7-173e10f21d9a&gbifDatasetKey=eed49c61-2085-46c4-8d4f-d2cda50fd404

Thanks.

MortenHofft commented 2 years ago

@ManonGros and @dnoesgaard will it always make sense to just look at the sum of citations when there is a duplicateOfDatasetKey? If so then we could do so in the literature API during processing?

dnoesgaard commented 2 years ago

Good question. I think I'd need to better understand how/when duplicateOfDatasetKey is used...

ahahn-gbif commented 2 years ago

Thanks for highlighting this!

We had discussed options when the reference to previous, deleted datasets was first established, and had decided against just merging the citation counts: an earlier version of a dataset that was then deleted and replaced can be quite different in content from a later one, and the citation was clearly on the earlier version. I think there is a related GitHub issue on this discussion, but will still have to try and find it.

sylvain-morin commented 2 years ago

@ahahn-gbif probably this one: https://github.com/gbif/portal-feedback/issues/1506 (from 2018)

ahahn-gbif commented 2 years ago

That's the one - thanks!

duplicateOfDatasetKey is an attribute of a "dataset" in the registry that holds the UUID of another dataset entity that is maintained, while the one marked as duplicate-of is deindexed and virtually deleted. Several deleted datasets can reference the same retained one, but not the other way around (as in: dataset mergers or multiple competing deleted versions are supported, while dataset splits are not).

Rather than fully merging citations of old and new versions of datasets, which I still think would be factually suspect, I would like to go back to exploring how to make citations of previous versions more easily accessible from the retained/present version, while maintaining transparency which versions they concern.

MattBlissett commented 2 years ago

Also https://github.com/gbif/portal-feedback/issues/2834

ahahn-gbif commented 2 years ago

@dnoesgaard, what is your take on merging citations for different versions (including earlier subsets, or substantially reworked data publications) from the data citation viewpoint, please?

Each citation would hopefully still refer to the originally existing dataset and come with a date, only that the scope of the data cited could be substantially different between versions. On the other hand, the same can be true for a single dataset that adds substantial amounts of records or widens the scope. Can we maintain reference accuracy, while making it easier for users (here: mostly dataset owners) to access all relevant citations of their data?

dnoesgaard commented 2 years ago

I don't have a complete overview of the scope or consequences of doing this. I'm obviously a bit reluctant but if it can be done in a way that maintains reference accuracy (as you put it), I'm not against it.

How often is duplicateOfDatasetKey used?

sylvain-morin commented 2 years ago

Let me just precise what I had in mind when I said to "sum the number of citations".

Here we have this deleted dataset: https://www.gbif.org/dataset/7732b10e-7845-434a-b1a7-173e10f21d9a

It has 26 citations:

I propose to keep this at it is: the 26 citations are still linked to this (deleted) dataset.

Here is the new dataset (linked through duplicateOfDatasetKey): https://www.gbif.org/dataset/eed49c61-2085-46c4-8d4f-d2cda50fd404

We can see it has 52 citations:

My proposition is not to merge citations, just to add the 26 former citations to the 52 new citations, in the top right counter. So displaying 78 citations instead:

When clicking on the citation counter, we will still arrive to the same screen, but the filter will be configured with both the old and the new dataset, like this:

We can see on the left, that we have 2 datasets selected, so I think it's quite clear that each citation is related to one dataset (the former or the new one), but they are not merged:

sylvain-morin commented 2 years ago

Another UX solution could be to display separately the cumulated count of citations:

The first button will open the list of citations related to the new dataset only, and the new button will display the list of citations of both datasets (former and new).

(about the term "cumulated citations", I'm sure we can find better :)

CecSve commented 2 years ago

@MortenHofft what do you think of this suggestion? Should we perhaps add a label to keep the suggestion in mind for the new portal layout?

MortenHofft commented 2 years ago

I'm not a big fan of just adding another button. It can be quite cramped in the header already.

I do think it makes sense to have in mind when revisiting dataset pages.

If this iaccumulated number is what we consider to be the new main number, then it should also be reflected in the activity tab on the citation card. And probably in the publisher counts as well. Which of course is more changes and more work.

Overall I think that if we consider the accumulated count to be the new main one we want to show, then it should be done on an API level (the API could still retain the pure counts of this exact dataset as well). Then it will also be reflected on publisher pages and reports.

If we on the other hand think of it more as a little extra information on dataset pages, then doing in in the UI by doing a few extra calls makes sense. But then I would prefer it as such - more like a comment/small note. I'm not sure how to do either visually.

But I also have a lot of questions:

But then what on publishers? should those accumulate as well. We currently have no way to do so.
How to do it? Even for datasets there is no APIs to do so. We do not have a way to navigate from the remaining to the previous version. So I suspect we need a new API for that, or rather a new field on datasets that list the previous versions.
Is the APIs functional? https://registry-api.gbif.org/dataset/duplicate I do not see any of the datasets the question is about in that list https://www.gbif.org/dataset/7732b10e-7845-434a-b1a7-173e10f21d9a | https://www.gbif.org/dataset/eed49c61-2085-46c4-8d4f-d2cda50fd404
Is the data maintained? This is circular A is a duplicate of B and B is a duplicate of A
What happens when dataset A is replaced by B that is replaced by C. In those cases I guess we should then accumulate A, B and C?

ahahn-gbif commented 2 years ago

The current situation: When a dataset is identified as a duplicate of another, we, in coordination with the publisher(s), identify the one of the two that should be kept, and mark the other one as the duplicate-of. This relationship is established in the registry, and the duplicate dataset will be automatically deleted.

As I understand it, we have two main goals here:

citations relating to a prior / duplicate / pre-merge version of a dataset are still relevant to the successor dataset in a is-duplicate-of chain. They are maintained with the deleted dataset, but they are not accessible except to anyone who stored the old dataset's UUID. We would like to make those citations accessible in a transparent way. Since there is no link back from the surviving dataset to the deleted one, it is currently not possible to find these citations
citations do relate to a specific version of the dataset. For transparency, I would still think that we do not want to drop the per-recordset count of citations altogether in favor of merging them all into a single large pool, but that is debatable.

I would pose that citations reported for a given dataset with its own DOI should really relate to that dataset alone. I would prefer finding a way to give access to the deleted datasets or their citation counts, possibly not on the main tab to prevent cluttering, but e.g. on the metrics tab under a "prior copies of this dataset" section or similar.

But then what on publishers? should those accumulate as well. We currently have no way to do so.

Good point. For dataset incarnation-chains, we should then also accumulate at publisher level, to avoid inconsistencies in counts. For publishers who withdraw a dataset altogether, though, we do not do this, and warn that citation links need to be maintained by them. Those non-replaced dataset citations should then also not figure in the per-publisher count

How to do it? Even for datasets there is no APIs to do so. We do not have a way to navigate from the remaining to the previous version. So I suspect we need a new API for that, or rather a new field on datasets that list the previous versions.

(skipping)

Is the APIs functional? https://registry-api.gbif.org/dataset/duplicate I do not see any of the datasets the question is about in that list https://www.gbif.org/dataset/7732b10e-7845-434a-b1a7-173e10f21d9a | https://www.gbif.org/dataset/eed49c61-2085-46c4-8d4f-d2cda50fd404

I cannot find any of the UUID either, and yet, the deleted duplicate https://www.gbif.org/dataset/7732b10e-7845-434a-b1a7-173e10f21d9a, in the header, shows the link to the surviving dataset ("Replaced By"). I would assume that this is an API call, but don't know enough where to look, sorry.

Is the data maintained? This is circular A is a duplicate of B and B is a duplicate of A

Depends on what you mean by maintained. The relationship is typically established internally through the registry UI, which enforces deletion of the duplicate record and prevents further editing, so that circular relationships should not be possible. It might need checking whether establishing this relationship is also accessible to authorized API users though, and whether in this case deletion and the block of further editing is likewise enforced.

What happens when dataset A is replaced by B that is replaced by C. In those cases I guess we should then accumulate A, B and C?

I guess the same, yes.

CecSve commented 2 years ago

and the duplicate dataset will be automatically deleted.

or rather we manually delete the duplicate after we have registered it as a duplicate.

MattBlissett commented 2 years ago

5. Is the APIs functional? https://registry-api.gbif.org/dataset/duplicate I do not see any of the datasets the question is about in that list https://www.gbif.org/dataset/7732b10e-7845-434a-b1a7-173e10f21d9a | https://www.gbif.org/dataset/eed49c61-2085-46c4-8d4f-d2cda50fd404

I cannot find any of the UUID either, and yet, the deleted duplicate https://www.gbif.org/dataset/7732b10e-7845-434a-b1a7-173e10f21d9a, in the header, shows the link to the surviving dataset ("Replaced By"). I would assume that this is an API call, but don't know enough where to look, sorry.

https://api.gbif.org/v1/dataset/7732b10e-7845-434a-b1a7-173e10f21d9a has duplicateOfDatasetKey as the second property.

https://api.gbif.org/v1/dataset/duplicate / https://registry.gbif.org/dataset/search?type=duplicate shows not deleted datasets with duplicateOfDatasetKey set. At least https://www.gbif.org/dataset/85fe49f4-f762-11e1-a439-00145eb45e9a / https://www.gbif.org/dataset/ca4ef53b-20ea-4667-a2da-a9f1635efc67 is a case where the relationship has been made the wrong way around.

gbif / portal-feedback

Sum the number of citations from former datasets #4102