Open jnothman opened 5 years ago
Our coref metrics are just copied from somewhere else; if there's a faster way to do it, we'd be happy to have it. PR welcome!
Well apart from anything, all implementations of these metrics that I've seen (including my own) recompute sufficient statistics in each metric implementation. What should be happening is that a sparse contingency matrix (including an extra vector for missing and spurious mentions) is constructed and reused by all metrics. Most metrics should then be a simple application of numpy/scipy operations. But that's an ideal that I've never got around to implementing since realising it was the right solution...
Couldn't that be a function in scipy? Something like sparse_linear_assignment
? It would probably be useful for other things.
Yeah, I've considered that. But it's only really applicable if it's sparsely connected: it's not easy to implement the kuhn-munkres algorithm directly on a sparse matrix iirc
Or it could be an option to the existing function for when we know that the bipartite graph is sparsely connected. Also what kind of speedups do you see with this method? I don't find it totally intuitive that eg connected components of size 1 add a lot of overhead to KM in practive.
It could be. Without time to benchmark, I'm going to ask you to trust me ;)
The point is actually that with entity disambiguation there's a small set of likely confusions for each name across a corpus, and names tokens have a power-law distributed a la Zipf. So most names are infrequent hence their mentions are relatively unambiguous. So for any gold-system pair, you're likely to have many clusters with perfect agreement. It's the frequency of such clusters that makes it worth creating a fast path for the 1k or k1 connected component.
Is your feature request related to a problem? Please describe. Calculation of the CEAF evaluation metric for coreference can be very slow, especially when there is a sparse contingency matrix (likely in long documents and especially entity linking).
Describe the solution you'd like Enormous performance improvements can be made by calculating the linear assignment in CEAF separately for each connected component in the sparse contingency graph (and avoiding initialising the linear assignment solver for connected components of size 1, i.e. clusters where there is no ambiguity in assignment).
Additional context I noted this issue while looking at the code related to #2945. I have implemented coreference evaluation metrics for Entity Linking at https://github.com/wikilinks/neleval/blob/master/neleval/coref_metrics.py. That code is also Apache licensed, so feel free to copy and attribute. Or feel free to ignore if you think this isn't an issue.