atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.61k stars 349 forks source link

Colored text can not be extracted #383

Open apache135 opened 4 years ago

apache135 commented 4 years ago

Hello Thanks for this great lib which bring much convenience to me.

I want to reflect two problems I met with it.

  1. When the table has one cell which contains text with blue color and no background, it can’t extent the content in it.

  2. I have a table which has 3 rows and 3 cols. The last row is. [‘is it a word?’,’ yes’, ‘’] after extraction, it returns the last row content is [‘is it a word?is it a word?’, ‘yes yes’,’’] Each cell has been repeated to return. The parameters I pass to read pdf is line_scale =30 split_text=True and the table regions

Sorry for that I cant upload the pdf file, if possible,could u offer some tips for troubleshooting?

anakin87 commented 4 years ago

Without the file, it is difficult to help you. One-page PDF, showing the issue, can help.

apache135 commented 4 years ago

Sorry for the encrypted file, I will find a similar one and upload . Thanks

apache135 commented 4 years ago

I found the colored font is text annote , which seems like added on the original pdf file via a pdf editor . Does Camelot support to extract text annote. I can’t find the relative information on the docs .