Include more information in the clustering

Daniel-Mietchen commented 2 years ago

As far as I can tell, the clusters are currently determined based on co-authors (either known as an item or a string), which works well for multi-author works but less so for works/ authors/ fields where single-author works are common. It is also often not sufficient when dealing with very common name strings, such as latinized Chinese names, especially if initialed.

It thus seems useful to enhance the clustering by using additional information, be it on the work items themselves (e.g. venue, topic, uses, cites) or on items pointing to them via claims (e.g. cites, described by source) or references (mainly stated in).

The citation information, while providing high signal-to-noise ratios for the purpose of author disambiguation, also comes with a significant performance overhead, so should probably only be enabled via a dedicated checkbox (similar to the one for fuzzy search) and disabled by default.

arthurpsmith commented 2 years ago

Actually venue and topic are part of the clustering algorithm, though each such match counts only the same as one matching coauthor name string. Citation relations is an interesting thing to consider for matching, I could see how that would work - if one article cites two others then those two should cluster together? Do you envision "described by source" or references working similarly - if a particular wikidata item refers to several different articles that way then they should cluster together?

Daniel-Mietchen commented 2 years ago

Yes, works cited from a target work should cluster together, and works citing the target works should cluster together too. Same for the other properties mentioned above.

arthurpsmith / author-disambiguator

Include more information in the clustering #177