HazyResearch / fonduer

A knowledge base construction engine for richly formatted data
https://fonduer.readthedocs.io/
MIT License
409 stars 77 forks source link

Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537 #542

Closed YasushiMiyata closed 3 years ago

YasushiMiyata commented 3 years ago

Description of the problems or issues

Is your pull request related to a problem? Please describe. See #534. This request redoes #537, which needs prior fixing #538 (fixed by #539).

Does your pull request fix any issue. See #534

Description of the proposed changes

In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. Proposal defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').

Test plan

This is cause of Japanese morphological analysis. So, I have added Japanese test data to 'tests/data/hocr_simple/japan.hocr' and test code to 'tests/parser/test_parser.py::test_parse_hocr'

Checklist

codecov-commenter commented 3 years ago

Codecov Report

Merging #542 (d76bd76) into master (5ab8e9c) will increase coverage by 0.04%. The diff coverage is 100.00%.

:exclamation: Current head d76bd76 differs from pull request most recent head 5ca10e8. Consider uploading reports for the commit 5ca10e8 to get more accurate results Impacted file tree graph

@@            Coverage Diff             @@
##           master     #542      +/-   ##
==========================================
+ Coverage   86.02%   86.07%   +0.04%     
==========================================
  Files          92       92              
  Lines        4773     4775       +2     
  Branches      899      899              
==========================================
+ Hits         4106     4110       +4     
+ Misses        476      475       -1     
+ Partials      191      190       -1     
Flag Coverage Δ
unittests 86.07% <100.00%> (+0.04%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...fonduer/parser/visual_parser/hocr_visual_parser.py 97.64% <100.00%> (+2.46%) :arrow_up:
lukehsiao commented 3 years ago

To avoid the need for you to rebase this again, I'm going to squash these commits and merge. But please update your other PR to be based off master if possible.