HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
434 stars 92 forks source link

Cell values are missing from a table #96

Closed HiromuHota closed 4 years ago

HiromuHota commented 4 years ago

Describe the bug

Cell values are missing from a table.

tests/input/md.pdf contains a table like below:

image

Here is the extracted table:

<table class="ocr_table" title="bbox 37 309 209 370">
    <tr>
        <td title="bbox 37 311 136 318">
            <span class="ocrx_line" title="bbox 37 309 135 321">
                <span class="ocrx_word" title="bbox 37 309 67 321">Name</span>
                <span class="ocrx_word" title="bbox 70 309 103 321">Lunch</span>
                <span class="ocrx_word" title="bbox 106 309 135 321">order</span>
            </span>
        </td>
        <td title="bbox 144 311 173 318">
            <span class="ocrx_line" title="bbox 144 309 172 321">
                <span class="ocrx_word" title="bbox 144 309 172 321">Spicy</span>
            </span>
        </td>
        <td title="bbox 181 311 205 318">
            <span class="ocrx_line" title="bbox 181 309 209 321">
                <span class="ocrx_word" title="bbox 181 309 209 321">Owes</span>
            </span>
        </td>
    </tr>
    <tr>
        <td title="bbox 37 328 128 334">
            <span class="ocrx_line" title="bbox 37 325 59 337">
                <span class="ocrx_word" title="bbox 37 325 59 337">Joan</span>
            </span>
            <span class="ocrx_line" title="bbox 70 325 126 337">
                <span class="ocrx_word" title="bbox 70 325 91 337">saag</span>
                <span class="ocrx_word" title="bbox 94 325 126 337">paneer</span>
            </span>
        </td>
        <td title="bbox 138 328 179 334">
            <span class="ocrx_line" title="bbox 138 325 177 337">
                <span class="ocrx_word" title="bbox 138 325 177 337">medium</span>
            </span>
        </td>
        <td title="bbox 192 328 209 334">
            <span class="ocrx_line" title="bbox 192 325 209 337">
                <span class="ocrx_word" title="bbox 192 325 209 337">$11</span>
            </span>
        </td>
    </tr>
    <tr>
        <td title="bbox 37 344 113 351">
            <span class="ocrx_line" title="bbox 37 342 61 354">
                <span class="ocrx_word" title="bbox 37 342 61 354">Sally</span>
            </span>
            <span class="ocrx_line" title="bbox 70 342 112 354">
                <span class="ocrx_word" title="bbox 70 342 112 354">vindaloo</span>
            </span>
        </td>
        <td title="bbox 138 344 162 351">
            <span class="ocrx_line" title="bbox 138 342 160 354">
                <span class="ocrx_word" title="bbox 138 342 160 354">mild</span>
            </span>
        </td>
        <td title="bbox 191 344 209 351">
            <span class="ocrx_line" title="bbox 191 342 209 354">
                <span class="ocrx_word" title="bbox 191 342 209 354">$14</span>
            </span>
        </td>
    </tr>
    <tr>
        <td title="bbox 37 361 133 367">
            <span class="ocrx_line" title="bbox 37 358 57 370">
                <span class="ocrx_word" title="bbox 37 358 57 370">Erin</span>
            </span>
        </td>
        <td/>
        <td title="bbox 197 361 209 367">
            <span class="ocrx_line" title="bbox 197 358 209 370">
                <span class="ocrx_word" title="bbox 197 358 209 370">$5</span>
            </span>
        </td>
    </tr>
</table>

https://github.com/HazyResearch/fonduer/blob/master/tests/data/hocr_simple/md.hocr

"lamb madras" and "HOT" are missing.

To Reproduce Steps to reproduce the behavior:

  1. Install pdftotree from the latest commit (bc658f70d289d38b41377be62ea51135cb723a8c)
  2. Execute pdftotree tests/input/md.pdf -o md.hocr

Expected behavior

"lamb madras" and "HOT" are not missing.

Error Logs/Screenshots

No error log.

Environment (please complete the following information):

Additional context Add any other context about the problem here.

HiromuHota commented 4 years ago

The output from Tabula:

df = tabula.read_pdf(self.pdf_file, pages=page_num, area=table, output_format="dataframe")
print(df)
[    ame Lunch order Spicy  Unnamed: 1 Owe
0  oan saag paneer medium         NaN  $1
1      ally vindaloo mild         NaN  $1
2     rin lamb madras HOT         NaN   $]

When 1 point of horizontal margin is added to the table area:

df = tabula.read_pdf(self.pdf_file, pages=page_num, area=[table[0], table[1] - 1, table[2], table[3] + 1], output_format="dataframe")
print(df)
[    Name Lunch order Spicy  Unnamed: 1 Owes
0  Joan saag paneer medium         NaN  $11
1      Sally vindaloo mild         NaN  $14
2     Erin lamb madras HOT         NaN   $5]