allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

CEAF calculation performance can be vastly improved #2946

Open jnothman opened 5 years ago

jnothman commented 5 years ago

Is your feature request related to a problem? Please describe. Calculation of the CEAF evaluation metric for coreference can be very slow, especially when there is a sparse contingency matrix (likely in long documents and especially entity linking).

Describe the solution you'd like Enormous performance improvements can be made by calculating the linear assignment in CEAF separately for each connected component in the sparse contingency graph (and avoiding initialising the linear assignment solver for connected components of size 1, i.e. clusters where there is no ambiguity in assignment).

Additional context I noted this issue while looking at the code related to #2945. I have implemented coreference evaluation metrics for Entity Linking at https://github.com/wikilinks/neleval/blob/master/neleval/coref_metrics.py. That code is also Apache licensed, so feel free to copy and attribute. Or feel free to ignore if you think this isn't an issue.

matt-gardner commented 5 years ago

Our coref metrics are just copied from somewhere else; if there's a faster way to do it, we'd be happy to have it. PR welcome!

jnothman commented 5 years ago

Well apart from anything, all implementations of these metrics that I've seen (including my own) recompute sufficient statistics in each metric implementation. What should be happening is that a sparse contingency matrix (including an extra vector for missing and spurious mentions) is constructed and reused by all metrics. Most metrics should then be a simple application of numpy/scipy operations. But that's an ideal that I've never got around to implementing since realising it was the right solution...

LoicGrobol commented 5 years ago

Couldn't that be a function in scipy? Something like sparse_linear_assignment? It would probably be useful for other things.

jnothman commented 5 years ago

Yeah, I've considered that. But it's only really applicable if it's sparsely connected: it's not easy to implement the kuhn-munkres algorithm directly on a sparse matrix iirc

LoicGrobol commented 5 years ago

Or it could be an option to the existing function for when we know that the bipartite graph is sparsely connected. Also what kind of speedups do you see with this method? I don't find it totally intuitive that eg connected components of size 1 add a lot of overhead to KM in practive.

jnothman commented 5 years ago

It could be. Without time to benchmark, I'm going to ask you to trust me ;)

The point is actually that with entity disambiguation there's a small set of likely confusions for each name across a corpus, and names tokens have a power-law distributed a la Zipf. So most names are infrequent hence their mentions are relatively unambiguous. So for any gold-system pair, you're likely to have many clusters with perfect agreement. It's the frequency of such clusters that makes it worth creating a fast path for the 1k or k1 connected component.