allenai / vila

Incorporating VIsual LAyout Structures for Scientific Text Classification
Apache License 2.0
167 stars 17 forks source link

Replace special tokens to normal text before passing into models #33

Closed lolipopshock closed 1 year ago

lolipopshock commented 1 year ago

When a paper contain a verbatim of some special tokens (e.g., [SEP] or [BLK]), the current code cannot appropriately handle them, after #29. One interesting example is that, as reported in #31, when parsing our own VILA paper, it will fail on page 2, where there are multiple occurrences of the [BLK] text in the paper. This PR proposes a simple fix -- by simply remove the square brackets [ and ] from the text.