Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.13k stars 754 forks source link

bug/Inferred Table Data -- info in text_as_html far less than text (cropped?) #2478

Closed hschmied closed 5 months ago

hschmied commented 9 months ago

Describe the bug I am parsing a PDF, which contains text and tables. It's in German, has a complex layout of many smaller tables, uses Umlauts (ä, ö, ü), and so on.

I am inferring tables and noticed, that in the returned elements (type: Table) the information in "text_as_html" is sometimes far less than in "text" or the original PDF.

I wonder if this example/case is just too complex to be parsed well or if it would be possible with some prior preprocessing/transcoding, different configuration or use of another model (other than the default-hi_res_model).

Any feedback/pointers what I can do to improve the result, would be appreciated. Thanks!

To Reproduce The way I call my unstructured-service (hosted on azure) is I think straight forward...

        elements = partition_via_api(
            api_url="http://***/general/v0/general",
            api_key="***",
            file=file,
            metadata_filename=file_name,
            strategy="hi_res",
            pdf_infer_table_structure=True,
            skip_infer_table_types="[]",
            chunking_strategy="by_title",
            max_characters="4000",
            new_after_n_chars="3800",
        )

Here's one of the extracted elements, which is faulty...

        {
            "element_id": "0da0d11164d4c4876aa721503d395782",
            "metadata": {
                "filename": "lampe02.pdf",
                "filetype": "application/pdf",
                "page_number": 1,
                "text_as_html": "<table><tr><td>e \u2014</td><td></td><td>E</td><td></td></tr><tr><td></td><td></td><td>E==\u2014\u20141</td><td></td></tr><tr><td>Transportkarton/Abmessun gen</td><td>L=562 B=531 H=245 mm</td><td>Enthalt verbaute LED.</td><td>nein</td></tr><tr><td></td><td></td><td>1P-Schutzart</td><td>20</td></tr><tr><td rowspan=\"2\">Triman-Kennzeichen</td><td></td><td>Kabelende</td><td>Direktanschiuf</td></tr><tr><td></td><td></td><td>Schutzklasse</td><td></td></tr></table>"
            },
            "text": "Artikel Elektrische Daten Lichttechnische Daten Produktma\u00dfe + Gewicht Artikelvariante 20100302C Dimmbar mit externem Dimmer nein Farbkonsistenz initial < 5 L\u00e4nge/Tiefe 507 mm Barcode Verpackungseinheit 4004894534999 Farbwertanteil X 0,459 Breite 40 mm Elektrischer Leistungsfaktor > 0,90 Farbwertanteil Y 0,413 H\u00f6he 60 mm Hersteller M\u00fcller-Licht Energieeffizienzklasse enthaltene Lichtquelle Lichtquelle mit EEK: F Farbtemperatur 2700 K Gewicht 301,00 g Zolltarifnummer 94051040900 Farbwiedergabeeigenschaft R9 \u22655 Produktdaten Gewichteter Verbrauch 8 kWh/1000h Verpackung Austausch stromlos nein Lebensdauer Nominalwert 25000 h Colorbox/Barcode 1 4004894534999 Farbwiedergabeeigenschaft Ra \u226580 Beleuchtungstechnologie LED Leistungsaufnahme Nominalwert 8 W Colorbox/Inhalt (St\u00fcck) 1 Bel\u00fcftung erforderlich nein Lichtausbeute Nominalwert 88 lm/W Colorbox/Gewicht 79,00 g Nom. Stromst\u00e4rke 70 mA Frostung Chemisch Lichtfarbe warmwhite Colorbox/Abmessungen L=65 B=43 H=545 mm Spannung Nominalwert 230 V Modell (Technisch) LED-R\u00f6hre Lichtstrom enthaltene Lichtquelle 700 lm Innerbox/Barcode 1 4004894852802 Stromart AC Nicht in Reflektoren betreiben nein Innerbox/Inhalt (St\u00fcck) 2 Frequenz Nominalwert 50/60 Hz Stroboskopeffekt 0,9 Innerbox/Gewicht 85,00 g Sockel S14s Verschiebungsfaktor (cos \u03c6) 0,62 Spektrumbild Innerbox/Abmessungen L=550 B=102 H=75 mm Inverkehrbringer M\u00fcller-Licht Transportkarton/Barcode 1 4004894852819 Marke M\u00fcller-Licht Transportkarton/Inhalt (St\u00fcck) 30 CE Kennzeichnung ja Garantieprodukt 5 Jahre Garantiebedingungen Transportkarton/Gewicht 1050,00 g Leuchtendaten Transportkarton/Abmessun gen L=562 B=531 H=245 mm Enth\u00e4lt verbaute LED nein IP-Schutzart 20 Umwelteigenschaften Kabelende Direktanschlu\u00df Triman-Kennzeichen Schutzklasse II",
            "type": "Table"
        },

Expected behavior More of the data from the PDF ending up in text_as_html.

Screenshots This is the corresponding section in the PDF... Screenshot 2024-01-30 at 18 52 51

And this is what's left in 'text_as_html'... Screenshot 2024-01-30 at 19 01 45

Environment Info Running the unscripted-api image on azure-VM

Additional context

isaacna commented 8 months ago

I'm also seeing this with partition_xlsx as well. In a pretty small 100 row sheet, the text_as_html only returns the first ~30

scanny commented 8 months ago

@isaacna can you check your unstructured version and update to the latest? There was a bug fix related to missing items in XLSX recently.

isaacna commented 8 months ago

@scanny That fixed the issue, thanks! We were previously on 0.12.4

christinestraub commented 8 months ago

@isaacna Can you share the PDF document you're trying?

isaacna commented 8 months ago

@christinestraub This was for an Excel spreadsheet (just some filler dummy data), not a PDF. We didn't see this issue for tables nested in PDFs specifically

christinestraub commented 8 months ago

@hschmied Can you share the PDF document you're trying?

hschmied commented 8 months ago

certainly -- here you go... lampe02.pdf

thank you! @christinestraub

christinestraub commented 8 months ago

@hschmied We've made some updates in table extraction recently. Although it's not perfect yet for your pdf, I can confirm that it has a few improvements. Did you try your code recently? You'll need to pass languages=["deu"] to improve text accuracy. We'll consider this case for further improvement.

hschmied commented 7 months ago

I have not checked recently, but will. thank you!

hschmied commented 7 months ago

quick update -- I looked into it and still got the old result, but I suspect the issue is that the hosted image of the unstructured-api on azure isn't running on the latest api-version, unless I do something. currently figuring out what needs to happen to get my azure-service up-to-date.

hschmied commented 7 months ago

just tested it -- it's great improvement! I tested with the same config as before and looked at the original section --> Screenshot: left-most = original, second = same settings w/ new api-version

then I tried it with the setting "languages: ['deu', 'eng']" and finally just with "languages: ['deu']"... image

it's not perfect yet, but a lot better. thank you!

LucasOliveira44 commented 6 months ago

@christinestraub Hi, I'm also facing the same issue. I am using yolox, and the model picks up the table but only the body and not the header. In addition to that the text_as_html cropped the body of the table leaving out the last row entirely.

This is the definition of the partition_pdf, unfortunately I can not share the table or the pdf, but is a very small table and the pdf is not complex at all. And I am using the version 0.13.6 of unstructured.

elements = partition_pdf(filename=filename,
                     strategy='hi_res',
                     hi_res_model_name="yolox",
                     infer_table_structure=True,
                     languages=["eng"]    
                    ) 

If ayone has any advice I would appreciate it. Thanks

MthwRobinson commented 5 months ago

Closing this one, if you need to process pages fast or recommendation use the unstructured-python-client library with our SaaS API. That will split up the PDF and distribute the workload across multiple workers.