Open Daniel-Mietchen opened 2 years ago
Actually venue and topic are part of the clustering algorithm, though each such match counts only the same as one matching coauthor name string. Citation relations is an interesting thing to consider for matching, I could see how that would work - if one article cites two others then those two should cluster together? Do you envision "described by source" or references working similarly - if a particular wikidata item refers to several different articles that way then they should cluster together?
Yes, works cited from a target work should cluster together, and works citing the target works should cluster together too. Same for the other properties mentioned above.
As far as I can tell, the clusters are currently determined based on co-authors (either known as an item or a string), which works well for multi-author works but less so for works/ authors/ fields where single-author works are common. It is also often not sufficient when dealing with very common name strings, such as latinized Chinese names, especially if initialed.
It thus seems useful to enhance the clustering by using additional information, be it on the work items themselves (e.g. venue, topic, uses, cites) or on items pointing to them via claims (e.g. cites, described by source) or references (mainly stated in).
The citation information, while providing high signal-to-noise ratios for the purpose of author disambiguation, also comes with a significant performance overhead, so should probably only be enabled via a dedicated checkbox (similar to the one for fuzzy search) and disabled by default.