POLDER-Crew / polder-federated-search

A federated search project for POLDER.
BSD 3-Clause "New" or "Revised" License
5 stars 1 forks source link

Avoid/identify/minimise duplicates #147

Open yemoski opened 1 year ago

yemoski commented 1 year ago

This is REALLY HARD TO DO.

yemoski commented 1 year ago

Here's an interesting conundrum: because of the fact that licensing info isn't available through DataONE yet, we actually get more information by indexing g-e-m ourselves than we'll get when it comes through DataONE. What do we do with that? Let them index it anyway and be happy when licensing makes it in? Somehow exclude g-e-m from DataONE queries?

yemoski commented 1 year ago

This is actually also a problem inside the dataONE data, because they have duplicates too! Should I consider things with the same DOI to be the same thing, and collapse them? I think, for the purposes of this tool, that might be a good move.

yemoski commented 1 year ago

I've started by removing duplicates in individual queries, but the broader problem of removing duplicates across the federated search remains.

yemoski commented 1 year ago

Can go some way towards this by making a SearchResult class that's hashable.