HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
407 stars 77 forks source link

CandidateExtractor doesn't scale for larger relations #546

Open robbieculkin opened 3 years ago

robbieculkin commented 3 years ago

Hello, thanks for providing this framework. My group has run into a bit of a snag:

For context, we've successfully completed candidate extraction & labeling for binary relations, with reasonable runtimes. With parallelism = 6, candidate extraction takes ~2 minutes per document.

We've since moved on to a 3-ary relation that is very similar to the binary relation. This 3-ary relation shares some mentions with the binary relation, and uses a very similar candidate extractor. We have done performance testing for the 3-ary throttler function and found it to have a very similar runtime to the binary throttler. Candidate extraction now takes 4 hours per document. This immense slowdown is due to the fact that computational complexity scales exponentially for each entity added to a relation.

Here are some numbers from our use case:

If our relation only includes (A,B), we have a total of 800*140 = 112,000 temporary candidates to evaluate with our throttler. Should we add mention C to form the relation (A,B,C), our total now grows to 800*140*150 = 16.8 million temporary candidates. We're unable to narrow our mention matchers further without excluding true positives.

This slowdown makes the Fonduer framework effectively unusable for any large-scale use case that requires relations with more than 2 entities. Can you provide guidance to address this issue?

robbieculkin commented 3 years ago

The first workaround that comes to mind is to override the CandidateExtractorUDF.apply method where temporary candidates are formed from the product of a document's mentions. https://github.com/HazyResearch/fonduer/blob/c9fd6b91998cd708ab95aeee3dfaf47b9e549ffd/src/fonduer/candidates/candidates.py#L263-L274

If we're able to group mentions by their contexts and form products within these groups, it can reduce the number of temporary candidates passed to the throttler. This restriction can be applied across the hierarchy of Fonduer's data model (Paragraph, Sentence, Table, ...).

For example, we can

  1. select all mentions within a given page
  2. take their Cartesian product,
  3. then sum the results across all pages produced via steps 1-2.

This way, we eliminate all temporary candidates that span different pages (if this is what the programmer indicates). This is preferable to the same_page method as it reduces computational complexity significantly - there are far fewer temporary candidates formed. Ultimately, the programmer can use a shared_context parameter to the .apply method to indicate the right degree of detail.