Closed HiromuHota closed 4 years ago
The output from Tabula:
df = tabula.read_pdf(self.pdf_file, pages=page_num, area=table, output_format="dataframe")
print(df)
[ ame Lunch order Spicy Unnamed: 1 Owe
0 oan saag paneer medium NaN $1
1 ally vindaloo mild NaN $1
2 rin lamb madras HOT NaN $]
When 1 point of horizontal margin is added to the table area:
df = tabula.read_pdf(self.pdf_file, pages=page_num, area=[table[0], table[1] - 1, table[2], table[3] + 1], output_format="dataframe")
print(df)
[ Name Lunch order Spicy Unnamed: 1 Owes
0 Joan saag paneer medium NaN $11
1 Sally vindaloo mild NaN $14
2 Erin lamb madras HOT NaN $5]
Describe the bug
Cell values are missing from a table.
tests/input/md.pdf
contains a table like below:Here is the extracted table:
https://github.com/HazyResearch/fonduer/blob/master/tests/data/hocr_simple/md.hocr
"lamb madras" and "HOT" are missing.
To Reproduce Steps to reproduce the behavior:
pdftotree tests/input/md.pdf -o md.hocr
Expected behavior
"lamb madras" and "HOT" are not missing.
Error Logs/Screenshots
No error log.
Environment (please complete the following information):
pdftotree
Version: v0.5.1+dev (bc658f70d289d38b41377be62ea51135cb723a8c)pdfminer.six
Version: 20200726Additional context Add any other context about the problem here.