Avoid duplicates in search interfaces

matthiaspalmer commented 10 months ago

In the situation where dataportals choose to make datasets and data services searchable in the same UI (since keeping them separate may prove to be bad UX) there is a risk that the search results look like there are duplicates.

The reason for such perceived duplicates is that there might be data services that serve a single dataset and therefore have the same title (or at least very similar). In addition, such data services often have limited metadata expressed as they are perceived as a technical construction and should be viewed in the context of the dataset they serve. For instance, such a data service may lack a publisher, a good description, license information etc.

We suggest to introduce the terminology "dependent" and "independent" data service where a dependent data service can be detected based on any of the following criteria:

Referred to from a single dataset in the same catalog (via its distributions)
Have a single dcat:servesDataset relation
No relation is expressed from the surrounding catalog (i.e. missing dcat:service)
No explicit publisher is given on the data service

Today the Swedish dataportal filters out dependent data services based on the criteria 1 and 4. The dataportal of Sachsen does the same but uses only criteria 1.

Note that both dataportals still provide a way to see the dependent data services by following links from the datasets they serve (and also provide some information already on the dataset page). It is only from the search interface the dependent data services have been excluded.

We think that it makes sense to include this categorization in the specification as we think it is likely other dataportals will want to implement similar behavior and slight deviations in what constitutes a "dependent" dataset service may cause confusion with regards to what shows up and what is suppressed. Especially since this also affects catalog providers ambitions with regard to how much metadata is suitable to express on "dependent" data services.

jakubklimek commented 10 months ago

Do you suggest just to introduce the terminology, or to use it somewhere as well?

To be honest, to me this seems like implementation choices of dataportal implementers, and therefore out of scope of DCAT-AP. Maybe this could be in some kind of separate "DCAT-AP best practices" document?

matthiaspalmer commented 10 months ago

Chapter 14 is called "Usage guide on Datasets, Distributions and Data Services". It could fit in there. I do not think there is a best practises document at the moment, also I fear it will not be visible enough.

I think guidance is especially important to portal developers as decisions taken there affect many others. Clearly the specification should not have an opinion on exactly how the portals should be implemented. But providing good terminology to make it easier for portal developers to agree on principles like which data services it is acceptable to suppress in a search interface seems important.

jakubklimek commented 10 months ago

Ah, yes, Chapter 14 seems like a good place to have this, if the terminology was to be used further.

But there are two things - establishing the terminology (OK) and then using it to formulate some recommendations, e.g. for search interfaces, where I still think that giving opinions on what should be suppressed in a presumed search interface is too much for the spec, e.g. for cases where someone comes up with an innovative search interface where the result does not have to be suppressed, and they would unnecessarily be in conflict with the recommendation.

matthiaspalmer commented 10 months ago

Great, my main focus is on establishing the terminology (or similar terminology if someone suggest a better wording than dependent / independent).

I agree that we should be careful with not presuming to much in such a recommendation. Better just say that dependent data services are assumed to be shown in close proximity to the dataset they support. Hence, the amount of metadata to be provided on dependent data services may be less extensive, e.g. providing a separate title is not as important as you have the title on both the dataset and potentially on a distribution to rely on.

Suppressing dependant data services in a search interface could perhaps be listed as an example. (Another example could be that a data portal groups dependent data services together with the dataset they serve in search results.)

SEMICeu / DCAT-AP

Avoid duplicates in search interfaces #274