atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.61k stars 349 forks source link

Yet another PDF datasheet with many tables that doesn't really work #459

Open yeus opened 3 years ago

yeus commented 3 years ago

https://www.mg-solar-shop.de/media/pdf/62/2e/3d/112768_Datenblatt_Trina_TSM_DE06M.08(II).pdf

The tables are very "graphic" and I can not give any hints where they are, as each pdf in my dataset is quiet different.

It correctly identifies lines using the "Lattice" method, but then its lost .. . It doesn't seem to be recognizing joints correctly among other things.. . I have played around with the parameters but didn't really get anything.

I would appreciate any help on this.

giampaolo44 commented 3 years ago

Can you say more on what you are trying to accomplish?

I am asking this because by giving the appropriate coordinates of the first table of page 2 I could get a decent representation of it using Stream with default parameters (with Lattice it was throwing an error). It could be further improved of course (coords were strangely not precise enough to skip the two notes), but first it would be wise to understand the above.

These were the coordinates I passed in: --coords {'x': 310, 'y': 110, 'w': 535, 'h': 196} <-- these come from the page converted in png, which are then recalculated to the PDF coords that Camelot expects --table_areas: 206,768,563,637 <-- these should be the PDF coords, passed to Camelot

and I used these parameters: --kwargs: {'strip_text': '', 'split_text': True, 'flag_size': True, 'flag_size_sup': True, 'row_tol': 2, 'col_tol': 0} which translate in -- strip_text=,split_text=True,flag_size=True,flag_size_sup=True,row_tol=2,col_tol=0 <-- parameters for the Camelot command, which in Python was

    table = camelot.read_pdf(
        pdf_file,
        flavor='stream',
        **kwargs,
        table_areas = [table_areas],
        pages = page
        )

    t_html = table[0].df.to_html(    <-- here I am converting it to my desired output, i.e. HTML
            header=False, index=False)

This is the resulting HTML:

<table class="dataframe" border="1">
  <tbody>
    <tr>
      <td></td>
      <td>TSM-330</td>
      <td>TSM-335</td>
      <td>TSM-340</td>
    </tr>
    <tr>
      <td>ELEKTRISCHE DATEN @ STC</td>
      <td></td>
      <td></td>
      <td></td>
    </tr>
    <tr>
      <td></td>
      <td>DE06M.08(II)</td>
      <td>DE06M.08(II)</td>
      <td>DE06M.08(II)</td>
    </tr>
    <tr>
      <td>Nominalleistung-P&lt;s&gt;MAX&lt;/s&gt; (Wp)*</td>
      <td>330</td>
      <td>335</td>
      <td>340</td>
    </tr>
    <tr>
      <td>Leistungstoleranz-P&lt;s&gt;MAX&lt;/s&gt; (W)</td>
      <td>0/+5</td>
      <td>0/+5</td>
      <td>0/+5</td>
    </tr>
    <tr>
      <td>Spannung im MPP-U&lt;s&gt;MPP&lt;/s&gt; (V)</td>
      <td>33,8</td>
      <td>34,0</td>
      <td>34,2</td>
    </tr>
    <tr>
      <td>Strom im MPP-I&lt;s&gt;MPP&lt;/s&gt; (A)</td>
      <td>9,76</td>
      <td>9,85</td>
      <td>9,94</td>
    </tr>
    <tr>
      <td>Leerlaufspannung-U&lt;s&gt;OC&lt;/s&gt; (V)</td>
      <td>40,6</td>
      <td>40,7</td>
      <td>41,1</td>
    </tr>
    <tr>
      <td>Kurzschlusstrom-I&lt;s&gt;SC&lt;/s&gt; (A)</td>
      <td>10,4</td>
      <td>10,5</td>
      <td>10,6</td>
    </tr>
    <tr>
      <td>Modulwirkungsgrad η&lt;s&gt;m&lt;/s&gt; (%)</td>
      <td>19,4</td>
      <td>19,7</td>
      <td>19,9</td>
    </tr>
    <tr>
      <td>STC: Einstrahlung 1000 W/m², Zelltemperatur 25 °C, Sp</td>
      <td>ektrale Verteilung von AM1,5</td>
      <td></td>
      <td></td>
    </tr>
    <tr>
      <td>*Messtoleranz: ±3%</td>
      <td></td>
      <td></td>
      <td></td>
    </tr>
  </tbody>
</table>
yeus commented 3 years ago

@giampaolo44. Thx for trying the coordinates.. at least given those, the tables gets extracted. I am trying to extract all the tables from page 2. Which is quiet hard apparently. I have several similar pdfs from the same company, but the tables are all over the place so it isn't possible (or at least hard to achieve automatically) to give specific coordinates like you did ..

I am surprised that camelot can't recognize the tables from page 2 using lattice. as they look very "clear" to my human eye.

What I would also be fine with having a lot of "false positives" as long as the right tables are in the list as well.

Can I somehow get access to the png? maybe this way I could try to detect the table areas myself using custom function in order to create a list of table areas as hints for the streaming function.

giampaolo44 commented 3 years ago

@giampaolo44. Thx for trying the coordinates.. at least given those, the tables gets extracted. I am trying to extract all the tables from page 2. Which is quiet hard apparently.

Sure. I doubt it might be of consolation, but everything in PDF conversion and extraction is quite hard. I could try to use the coordinates because I wrote a custom piece of software to select areas from PDFs and to apply Camelot as well as OCR tools to them, and it's been quite a challenge. Sorry I can't pass you the code but it's my company's.

Can I somehow get access to the png? maybe this way I could try to detect the table areas myself using custom function in order to create a list of table areas as hints for the streaming function.

You can convert PDF pages into pngs with many tools. If you use the command line you can try ImageMagik's convert utility, or if you use Python you might want to try convert_from_path from pdf2image. Remember to take into account PDF's reversed coordinates, so start counting from the bottom left instead than top left as we tend to do with images.

yeus commented 3 years ago

@giampaolo44. Thx for trying the coordinates.. at least given those, the tables gets extracted. I am trying to extract all the tables from page 2. Which is quiet hard apparently.

Sure. I doubt it might be of consolation, but everything in PDF conversion and extraction is quite hard. I could try to use the coordinates because I wrote a custom piece of software to select areas from PDFs and to apply Camelot as well as OCR tools to them, and it's been quite a challenge. Sorry I can't pass you the code but it's my company's.

Can I somehow get access to the png? maybe this way I could try to detect the table areas myself using custom function in order to create a list of table areas as hints for the streaming function.

You can convert PDF pages into pngs with many tools. If you use the command line you can try ImageMagik's convert utility, or if you use Python you might want to try convert_from_path from pdf2image. Remember to take into account PDF's reversed coordinates, so start counting from the bottom left instead than top left as we tend to do with images.

no worries ;). I guess i'll also just have to do some experiments on my own. I'll try to remember posting my results here once I have done a bit more on this regard...