HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
428 stars 90 forks source link

Treat non-breaking space as a white space at get_word_boundaries() #98

Closed HiromuHota closed 3 years ago

HiromuHota commented 3 years ago

Description of the problems or issues

Is your pull request related to a problem? Please describe.

A non-breaking space (\xa0) causes "Out of order" warnings at md.pdf

$ pip list | grep pdftotree
pdftotree                     0.5.0
$ pdftotree tests/input/md.pdf -v > /dev/null
[INFO] pdftotree.core - Digitized PDF detected, building tree structure...
[INFO] pdftotree.core - Tree structure built, creating html...
[WARNING] pdftotree.TreeExtract - Out of order (Markdown,  )
[WARNING] pdftotree.TreeExtract - Out of order (Markdown, M)
[WARNING] pdftotree.TreeExtract - Out of order (Markdown, a)
[WARNING] pdftotree.TreeExtract - Out of order (Markdown, r)
[WARNING] pdftotree.TreeExtract - Out of order (Markdown, k)
[WARNING] pdftotree.TreeExtract - Out of order (Markdown, d)
[WARNING] pdftotree.TreeExtract - Out of order (Markdown, o)
[WARNING] pdftotree.TreeExtract - Out of order (Markdown, w)

Does your pull request fix any issue.

N/A

Description of the proposed changes

Treat non-breaking space as a white space at get_word_boundaries()

Test plan

A clear and concise description of how you test the new changes.

Checklist

codecov-io commented 3 years ago

Codecov Report

Merging #98 into master will increase coverage by 0.03%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #98      +/-   ##
==========================================
+ Coverage   66.18%   66.22%   +0.03%     
==========================================
  Files          22       22              
  Lines        2552     2555       +3     
==========================================
+ Hits         1689     1692       +3     
  Misses        863      863              
Flag Coverage Δ
#unittests 66.22% <100.00%> (+0.03%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pdftotree/TreeExtract.py 88.85% <100.00%> (ø)
tests/test_basic.py 96.00% <100.00%> (+0.16%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 3bdc6b9...d099d77. Read the comment docs.