clarin-eric / VLO

Virtual Language Observatory
GNU General Public License v3.0
14 stars 6 forks source link

Prefering original resources in presentation of record duplicates #316

Open teckart opened 3 years ago

teckart commented 3 years ago

In cases where the VLO importer identifies record duplicates (currently based on name and language), the record presented on the search page might not be the one from the resource owner, but another record provided by an external catalogue. Ways to reduce this behavour have to be evaluated and implemented.

Example: "Arabic Speech Corpus" OTA vs. ELRA

Helpful links:

teckart commented 3 years ago

The Solr collapsing mechanism provides min/max/sort parameters to select a group's head. We could create an (optional) index field to indicate the preference of a specific resource based on its origin and use it in the query, but it is still unclear what information we would use for that. We could for example maintain a list of endpoints that are mostly "aggregators" (of external resources) for downvoting, but this would mean additional configuration & maintenance and would be a bit random in some cases (like LINDAT's "LRT inventory"). This might also be the case when prefering a dataProvider over others.

twagoo commented 3 years ago

Something to keep in mind: we already have boosts in place for things like availability, presence of description, position in hierarchy (see solrconfig.xml) that now help determine the group's head. By default the selection takes into account relevance with respect to the query as well.

We will have to carefully decide whether we want to add logic 'on top' of this, or have a completely separate policy for the selection of the head. I don't have a clear preference right now but we have to make sure that we don't inadvertently discard a useful ranking mechanism.