HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
409 stars 77 forks source link

Add get_bbox() to Sentence and (Temporary)SpanMention, and deprecate bbox_from_span and bbox_from_sentence #429

Closed HiromuHota closed 4 years ago

HiromuHota commented 4 years ago

This PR has two benefits:

  1. Type hints become short, hence easier to understand: Tuple[int, int, int, int, int] to Bbox.
  2. No need to import bbox_from_sentence and bbox_from_span. Just use sentence.get_bbox() and span.get_bbox(). This is actually the benefit from using OOP.

P.S. There are many more spots where OOP (object-oriented programming) is more suited.

HiromuHota commented 4 years ago

I think I fixed the issue.

codecov-commenter commented 4 years ago

Codecov Report

Merging #429 into master will decrease coverage by 0.15%. The diff coverage is 75.60%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #429      +/-   ##
==========================================
- Coverage   82.59%   82.44%   -0.16%     
==========================================
  Files          86       86              
  Lines        4366     4385      +19     
  Branches      810      812       +2     
==========================================
+ Hits         3606     3615       +9     
- Misses        572      582      +10     
  Partials      188      188              
Flag Coverage Δ
#unittests 82.44% <75.60%> (-0.16%) :arrow_down:
Impacted Files Coverage Δ
src/fonduer/candidates/models/span_mention.py 74.76% <60.00%> (-0.73%) :arrow_down:
src/fonduer/parser/models/sentence.py 93.38% <60.00%> (-1.44%) :arrow_down:
src/fonduer/utils/utils_visual.py 58.33% <71.42%> (-6.67%) :arrow_down:
src/fonduer/utils/data_model_utils/visual.py 88.23% <77.77%> (ø)
src/fonduer/parser/visual_linker.py 83.57% <100.00%> (+0.07%) :arrow_up:
src/fonduer/utils/visualizer.py 79.16% <100.00%> (ø)
senwu commented 4 years ago

Thanks for this PR!

What do you think about having a fonduer.typing module to store all Fonduer specific typings? Bbox is a good example here.

HiromuHota commented 4 years ago

Having a fonduer.typing module is a good idea. IMO, this will have type aliases to make codes more readable. For example, the following type hints are very lengthy and hard to read.

https://github.com/HazyResearch/fonduer/blob/bd79688230ba11f5ed97e58cc78548de14b2c3db/src/fonduer/parser/visual_linker.py#L31-L33

By defining type aliases like below:

Alias1 = List[Tuple[Tuple[int, int], str]]
Alias2 = OrderedDict[Tuple[str, int], Tuple[int, int]]

This could become

self.pdf_word_list: Optional[Alias1] = None
self.html_word_list: Optional[Alias1] = None 
self.links: Optional[Alias2] = None 

However, Bbox is not an alias, hence IMO it is not suited to be placed fonduer.typing. Moreover, I'll be adding methods to Bbox like Bbox.horz_aligned (superseding bbox_horz_aligned) and Bbox.vert_aligned (superseding bbox_vert_aligned), which makes Bbox less suitable in fonduer.typing.

HiromuHota commented 4 years ago

A good example would be: alias for Throtter

Currently, https://github.com/HazyResearch/fonduer/blob/bd79688230ba11f5ed97e58cc78548de14b2c3db/src/fonduer/candidates/candidates.py#L65

with

Throttler=Callable[[Tuple[Mention, ...]], bool]

to

throttlers: Optional[List[Throttler]] = None,