De-duplication - Githubissues

twagoo commented 6 years ago

Solr offers some features allowing for de-duplication, i.e. hiding or collapsing records with very high similarity. This might be useful, e.g. to initially collapse similar records in search results to provide a better overview and easier navigation for the user. Records probably shouldn't be excluded altogether (this easily becomes intransparent), but using the 'fuzzy hash' to collapse similar results seems worth looking into.

twagoo commented 6 years ago

Result grouping can perhaps be used here, combined with the fuzzy hash/signature feature (see de-duplication link).

The client (solrj) will have to process groups instead of a single result list. We will have to see how this interacts with the ranking.

dietervu commented 6 years ago

At the meeting of the NCF today, the need for such a collapsing feature was stressed. Would be good to give this a higher priority.

teckart commented 6 years ago

After playing around with Solr's de-duplication mechanisms, the preliminary result is that we probably can use the default implementation. I am still struggling with the problem that partial updates overwrite signatures with empty values. Looks a lot like the problem mentioned in this (closed) ticket.

For querying results we should use the "collapse and expand" approach (instead of grouping), because of two reasons:

it was specifically designed to improve performance when the number of different groups gets very high (which is the case for the VLO)
the SolrJ interface is the same as we are using right now: QueryResponse.getResults() will return the "best" document of the group(s), QueryResponse.getExpandedResults() contains the other "collapsed" documents of the selected group(s). This should make it rather easy to modify the webapp code. The ranking of the results (both the groups and the documents in a group) may still be problematic and will require further testing, but there are parameters to tweak those.

What remains as an open question is: what do we want to group? and how aggressive we want to collapse results. For testing purposes, I am currently creating signatures on the fields languageCode, dataProviderName and description. This already helps a lot for near-duplicates like those from Talkbank (example query).

If there is a demand for this feature, there may be concrete wishes where grouping is expected to work. Are there other examples we should think of? For example, I would avoid grouping results of different providers to reduce emerging discussions and I am unsure if using fuzzy signatures is really a good idea, as it may lead to intransparent results for both users and providers. But of course there may be other views on this issue...

twagoo commented 6 years ago

Great news! Sounds like a good approach that could be integrated into the front end relatively easily.

I agree with @teckart that we should only group the really 'obvious' duplicates, i.e. those records with (nearly) identical names and descriptions from the same provider. Fuziness is cool but indeed requires a lot of caution.

Perhaps hierarchy can also be encoded in the signature, that is records in the same subcollection are perhaps better candidates to be grouped together than those that aren't. I'm thinking, for example, of newspaper editions from Europeana (e.g. Luxemburger Wort) which currently do not have hierarchy information encoded but the intention is to add this in the near future. Can multiple strategies be used somehow?

One more question: does the query affect the determination of the groups?

dietervu commented 6 years ago

Thanks Thomas, this sounds very good. I would also suggest to include Collection - since that fits naturally with sets that can correctly be collapsed. What we would need to experiment with is if we want this collapsing to happen only in case of a high number of results (say >100) or always. The way Google does it (always collapsing, even when there are not so many hits) might be the most consistent.

I also agree on avoiding grouping records that come from different providers, this would confuse people.

teckart commented 6 years ago

Perhaps hierarchy can also be encoded in the signature

Grouping/collapsing is always using a hash based on (a subset of) the other record fields. We could use the hierarchy root as part of the hash inputs, but this would have the effect that only records with the same hierarchy root are grouped together. This is a very strong restriction and probably not helpful for other use cases. There seems to be no reasonable support for collapsing/grouping by multiple fields in Solr (except of course by distributing the task to multiple queries). We could create multiple hash values, but then the complexity of deciding which one to use has to be tackled in the Webapp. Could be a UI switch between "Group results by origin" and "Group results by similar content". The advantage would be that the Solr query mechanism stays the same (i.e. collapsing by a field) and the UI consistent: some "inline" list that is only shown after "Click here for X more similar records" for both.

One more question: does the query affect the determination of the groups?

No, the groups are only determined by the hash value based on record data that is available during import time.

I would also suggest to include Collection - since that fits naturally with sets that can correctly be collapsed.

Sure. I will add it to the configuration.

Regarding the aforementioned problem with atomic/partial updates: I asked at the official Solr support mailling list for help. If there is no solution in a reasonable amount of time, I will do the signature generation in the importer. That might be a suboptimal solution especially if we decide to use fuzzy signatures in the future. However, Solr's fuzzy signature approach didn't look very promising to me either (at least when using the out-of-the-box configuration).

teckart commented 6 years ago

As there was no feedback from the Solr mailing list, signature creation is now handled as part of the importer (63e1556032f94ca17a00dc646eeee043f64b1bb5), using MD5 hashes based on the following concatenated document fields (if available): languageCode, dataProvider, description and collection.

dietervu commented 6 years ago

Thanks for the update! Does that mean that only records where those fields are exactly the same will be grouped? (If this is the case we can probably leave out dataProvider, since I assume it will be constant over a single collection)

teckart commented 6 years ago

Yes, those fields have to be identical to group two records. If we want to use a more "fuzzy" approach, it would be easy to reimplement Solr's solution for that (TextProfileSignature) which may have the drawbacks already mentioned above.

I am not sure about removing dataProvider from the list. For records where //MdCollectionDisplayName is empty and there is no description either (we have this combination for 14K CLARIN records), it would group all of them having the same language information (but maybe that is what we want?). Keeping dataProvider reduces this effect to every single endpoint.

twagoo commented 6 years ago

It might be worth investigating if the hashing logic implemented in 63e1556032f94ca17a00dc646eeee043f64b1bb5 could be improved by one or more of the following:

Using a different hashing algorithm - I believe some are out there that perform better than md5
Stream based approach (filter field names -> get string -> collect into joined string or directly into digest method)
Find a hash implementation (md5 or other) that takes a CharSequence as an input rather than a String so that the concatenated values don't have to be materialised into a String object

My gut feeling is that perfomance could be optimised compared to the current implementation. Of course we would need to measure the actual performance of any proposed improvement (there should be significant a difference on ~100k operations).

clarin-eric / VLO

De-duplication #113