DataONE results have duplicates because I'm not showing only the latest version of a data set

POLDER-Crew / polder-federated-search

A federated search project for POLDER.

BSD 3-Clause "New" or "Revised" License

5 stars 1 forks source link

DataONE results have duplicates because I'm not showing only the latest version of a data set #176

Open yemoski opened 1 year ago

yemoski commented 1 year ago

Search for 'temperature' to see what I mean.

yemoski commented 1 year ago

I'm not sure what, if anything, we can do about this.

yemoski commented 1 year ago

There's the 'obsoletes' field, but none of the DOIs that show up in there are in this result set, which means that the obsoletedBy clause is doing its job.

yemoski commented 1 year ago

https://search.dataone.org/cn/v2/query/solr/?start=0&fq=(northBoundCoord:[50%20TO%20*]%20OR%20southBoundCoord:[*%20TO%20-50])%20AND%20-obsoletedBy:*&q=temperature&wt=json&fl=*,score is the raw query for this

yemoski commented 1 year ago

The three results with the title " Historic air temperatures in Alaska for 1901-2015, with spatial subsetting by region" are a good example of this.

They have three separate DOIs, but if you go to their landing pages, two have a pointer to the most recent one with the text " A newer version of this dataset exists. View it now."

yemoski commented 1 year ago

the way DataONE solves this is to differentiate the Persistent Identifier (PID) that maps to a specific content-immutable version of a file or package, and the Series Identifier (SID) that maps to the most recent version in a chain of versions. There are more details in the DataONE API docs.

When they harvest from a SO provider, they checksum the canonicalized version of the JSON-LD as the PID, and use the provided dc:identifier as the SID. When the repository modifies a record, that results in a new checksum (and a new PID), and they then update the SID to point at that most recent version. This allows them to maintain version history of all objects from the schema.org harvests, while also directing search results to only the most recent published version.

(via Matt Jones)

yemoski commented 1 year ago

That solution from Matt went a long way but didn't fix 100% of the cases.

yemoski commented 1 year ago

From Matt Jones: "one easy way to do this is to add +AND+-obsoletedBy:* to your solr query"

yemoski commented 1 year ago

Another search, from someone who wrote in:

When I enter the search term “Andrill” I’m returned 122 results. The first page shows the first 50 results, but I can’t see the option to tab to the next page of results.

I’ve also noticed that there are duplicate results being returned. I expected this search to return 91 results (my understanding is that all records from this project are held at Pangaea, but I could be wrong).