Replace Certain Unicode Characters for the input

HF tokenizer will replace certain unicode characters with a space ' '. Therefore, the token-level prediction will become shorter than the input, which can cause mis-matched sequences. This PR tries to fix this issue via enabling replacing such unicode characters with the [UNK] tokens. We replace unicode characters in certain "categories", namely, ["Cc", "Cf", "Co", "Cs", "Mn", "Zl", "Zp", "Zs"], as specified by the rules in the corresponding HF tokenizer:

Usage:

df_predictor.predict(pdf_data, page_size, replace_empty_unicode=False)

A future update could be just replacing the unicode characters in the cached file examples/find-empty-unicode-chars/zero-length-unicode-chars.txt, which we've tested and confirmed that has zero tokenization lengths.

allenai / vila

Replace Certain Unicode Characters for the input #23