HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
409 stars 77 forks source link

Fix the non-deterministic behavior in VisualLinker #458

Closed HiromuHota closed 4 years ago

HiromuHota commented 4 years ago

Description of the problems or issues

Is your pull request related to a problem? Please describe.

See #412.

Does your pull request fix any issue.

Fix #412.

Description of the proposed changes

A clear and concise description of what you propose.

Based on my findings reported at #412, the non-deterministic behaviour in Featurization boils down to a non-deterministic behaviour in VisualLinker, which is caused by the fact that the order of doc.sentences is not deterministic unless order_by=True in the backref argument below. https://github.com/HazyResearch/fonduer/blob/246a92dd05884e3a706ba216e6177456d537de13/src/fonduer/parser/models/sentence.py#L248-L252 One trivial way to fix this issue is to add order_by=True to the backref, however, I just sort sentences (by position) at the beginning of VisualLinker.link and I think this is good enough to fix #412.

Test plan

A clear and concise description of how you test the new changes.

Ideally, I'd like to test if my change prevents #412 from happening. However, this is not feasible as the issue is non-deterministic and does not happen every test run. Instead, I'd test if the result of visual linker is affected by the order of sentences.

Checklist

codecov-commenter commented 4 years ago

Codecov Report

Merging #458 into master will increase coverage by 0.00%. The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #458   +/-   ##
=======================================
  Coverage   83.20%   83.20%           
=======================================
  Files          88       88           
  Lines        4554     4555    +1     
  Branches      837      837           
=======================================
+ Hits         3789     3790    +1     
  Misses        572      572           
  Partials      193      193           
Flag Coverage Δ
#unittests 83.20% <100.00%> (+<0.01%) :arrow_up:
Impacted Files Coverage Δ
src/fonduer/parser/visual_linker.py 83.65% <100.00%> (+0.07%) :arrow_up: