gbif / registry

GBIF Registry
Apache License 2.0
34 stars 15 forks source link

Matching occurrences to inactive entities #448

Open MortenHofft opened 2 years ago

MortenHofft commented 2 years ago

https://hp-nhc.gbif-staging.org/institution/e87408d2-ac77-41d0-9547-bb81da3cb0e7

200K records matched to an inactive institution that were transferred and merged to the zoological collections of Finnish Museum of Natural History (MZH) in 2008..

So it should really be matched to https://hp-nhc.gbif-staging.org/institution/9e87321e-8dbc-460a-b553-9a68c2858b1d

Should we add a flag for cases where we match to inactive entities?

marcos-lg commented 2 years ago

I was checking some of the records and it matches to that institution because the code matches.

For example, this record https://www.gbif.org/occurrence/49639236 provides the code of the University of Helsinki, Department of Applied Biology and there is only one match so we don't check if it's active. Therefore the lookup returns this as a doubtful match since the code matches aren't considered very trustworthy: https://api.gbif.org/v1/grscicoll/lookup?institutionCode=DABUH&verbose=true

If there were more than 1 match the lookup chooses the active institution to disambiguate.

But maybe this is not the example you were talking about? if not you can send me the record and I can take a closer look.

marcos-lg commented 2 years ago

What we probably should do in these cases is to merge the institutions.

MortenHofft commented 2 years ago

I agree. It looks like they should be merged. I just wondered if we should have a flag for cases where we link to inactive entities. Simply because it entails that something is wrong. But I'm not sure that is the case because I'm not sure I understand how inactive is used.

So yes, I was imprecise. What I meant was:

ManonGros commented 2 years ago

I don't think the inactive should be deleted every time. Sometimes there is a case for an inactive collection or institution. For example, you might have records for specimens of collections that have since been lost or destroyed. Or you might want to emphasise the historical collections (even if they have since then been moved to bigger collections). We should leave it for the institutions to decide.

It would be best to avoid adding new flags. I find that publishers get very confused by the number of flags we already have. It is hard enough to explain to them how to link occurrences in the first place. Could there be a way to show an institution is inactive without more flags?

In any case, I have merged the institutions mentioned in the original issue. Thanks for the suggestion @MortenHofft !