Closed YasushiMiyata closed 3 years ago
Merging #542 (d76bd76) into master (5ab8e9c) will increase coverage by
0.04%
. The diff coverage is100.00%
.:exclamation: Current head d76bd76 differs from pull request most recent head 5ca10e8. Consider uploading reports for the commit 5ca10e8 to get more accurate results
@@ Coverage Diff @@
## master #542 +/- ##
==========================================
+ Coverage 86.02% 86.07% +0.04%
==========================================
Files 92 92
Lines 4773 4775 +2
Branches 899 899
==========================================
+ Hits 4106 4110 +4
+ Misses 476 475 -1
+ Partials 191 190 -1
Flag | Coverage Δ | |
---|---|---|
unittests | 86.07% <100.00%> (+0.04%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
...fonduer/parser/visual_parser/hocr_visual_parser.py | 97.64% <100.00%> (+2.46%) |
:arrow_up: |
To avoid the need for you to rebase this again, I'm going to squash these commits and merge. But please update your other PR to be based off master if possible.
Description of the problems or issues
Is your pull request related to a problem? Please describe. See #534. This request redoes #537, which needs prior fixing #538 (fixed by #539).
Does your pull request fix any issue. See #534
Description of the proposed changes
In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. Proposal defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').
Test plan
This is cause of Japanese morphological analysis. So, I have added Japanese test data to 'tests/data/hocr_simple/japan.hocr' and test code to 'tests/parser/test_parser.py::test_parse_hocr'
Checklist