atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 354 forks source link

Lattice with html output truncates text longer than 46 characters #279

Closed pablobarria closed 4 years ago

pablobarria commented 5 years ago

Version: 0.7.1

Basically what the title says: Using lattice with html output, the resulting files have text over 46 characters truncated, with an ellipsis appended to indicate the missing text. Not sure that this is an option that can be turned off, doesn't appear to be.

anakin87 commented 5 years ago

Since table is treated as a Pandas dataframe, this behaviour depends on Pandas configuration.

Try with the following:

import pandas as pd

pd.set_option('display.max_colwidth', -1)

tables[0].to_html('table_0.htm')

pablobarria commented 5 years ago

Thank you. Is there a way to address this for the CLI version?

vinayak-mehta commented 5 years ago

There isn't a way to do this from the CLI right now, would you like to open a PR?

pablobarria commented 5 years ago

Don't really know what that entails.

anakin87 commented 5 years ago

@vinayak-mehta : No truncation should be the default behaviour or only an option?

vinayak-mehta commented 5 years ago

Since the default user expectation is to get all the text out of a PDF, we could have this as a default behavior when output format is HTML. What do you think?

anakin87 commented 5 years ago

When the output is CSV or other (not HTML), we don't want anyway to extract all the text?