allenai / vila

Incorporating VIsual LAyout Structures for Scientific Text Classification
Apache License 2.0
167 stars 17 forks source link

end2end-sci-pdf-parsing doesn't work #31

Closed dennissoftman closed 1 year ago

dennissoftman commented 1 year ago

Running example code, following README line by line produces this error:

image

any ideas on how to fix that?

lolipopshock commented 1 year ago

Thank you @dennissoftman for reporting this issue ! And may I ask a bit more details to better help me identify this issue --

  1. Is it the direct output of the used docker container?
  2. Does it only happen to one specific paper, or for every paper you've passed in?
    • If it is the former case, any chance you can provide a link/URL to the original paper so I can check what's going on?
dennissoftman commented 1 year ago
  1. Yes, this is output from docker, that's what I've seen in terminal after uploading document
  2. For some documents it works, I assume there's some problem with certain encodings. I've used this document from example: "https://arxiv.org/pdf/2106.00676.pdf" So basically I was just following step-by-step this example https://github.com/allenai/vila/tree/main/examples/end2end-sci-pdf-parsing
lolipopshock commented 1 year ago

Thank you very much @dennissoftman ! Yes, it is caused by some weird unicodes that hugging tokenizers cannot perfectly handle. Let me take a look and should be able to get back to you by the end of this week!

lmessinger commented 1 year ago

Hi

any updates on this...? thanks!

lolipopshock commented 1 year ago

Hey @lmessinger and @dennissoftman -- sorry for a bit delay (it is indeed an interesting bug...), but this is fixed in #33 . Let me know if there are other issues. Thanks!