HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
428 stars 90 forks source link

Do not escape text twice #100

Closed HiromuHota closed 3 years ago

HiromuHota commented 3 years ago

Description of the problems or issues

Is your pull request related to a problem? Please describe.

Currently, a text is escaped twice. As a result, & becomes &.

Does your pull request fix any issue.

N/A

Description of the proposed changes

Escape text only once

Test plan

Add tests on extracted text to make sure text is escaped properly.

Checklist