Closed hschmied closed 5 months ago
I'm also seeing this with partition_xlsx
as well. In a pretty small 100 row sheet, the text_as_html
only returns the first ~30
@isaacna can you check your unstructured
version and update to the latest? There was a bug fix related to missing items in XLSX recently.
@scanny That fixed the issue, thanks! We were previously on 0.12.4
@isaacna Can you share the PDF document you're trying?
@christinestraub This was for an Excel spreadsheet (just some filler dummy data), not a PDF. We didn't see this issue for tables nested in PDFs specifically
@hschmied Can you share the PDF document you're trying?
certainly -- here you go... lampe02.pdf
thank you! @christinestraub
@hschmied We've made some updates in table extraction recently. Although it's not perfect yet for your pdf, I can confirm that it has a few improvements. Did you try your code recently? You'll need to pass languages=["deu"]
to improve text accuracy. We'll consider this case for further improvement.
I have not checked recently, but will. thank you!
quick update -- I looked into it and still got the old result, but I suspect the issue is that the hosted image of the unstructured-api on azure isn't running on the latest api-version, unless I do something. currently figuring out what needs to happen to get my azure-service up-to-date.
just tested it -- it's great improvement! I tested with the same config as before and looked at the original section --> Screenshot: left-most = original, second = same settings w/ new api-version
then I tried it with the setting "languages: ['deu', 'eng']" and finally just with "languages: ['deu']"...
it's not perfect yet, but a lot better. thank you!
@christinestraub Hi, I'm also facing the same issue. I am using yolox, and the model picks up the table but only the body and not the header. In addition to that the text_as_html cropped the body of the table leaving out the last row entirely.
This is the definition of the partition_pdf, unfortunately I can not share the table or the pdf, but is a very small table and the pdf is not complex at all. And I am using the version 0.13.6 of unstructured.
elements = partition_pdf(filename=filename,
strategy='hi_res',
hi_res_model_name="yolox",
infer_table_structure=True,
languages=["eng"]
)
If ayone has any advice I would appreciate it. Thanks
Closing this one, if you need to process pages fast or recommendation use the unstructured-python-client
library with our SaaS API. That will split up the PDF and distribute the workload across multiple workers.
Describe the bug I am parsing a PDF, which contains text and tables. It's in German, has a complex layout of many smaller tables, uses Umlauts (ä, ö, ü), and so on.
I am inferring tables and noticed, that in the returned elements (type: Table) the information in "text_as_html" is sometimes far less than in "text" or the original PDF.
I wonder if this example/case is just too complex to be parsed well or if it would be possible with some prior preprocessing/transcoding, different configuration or use of another model (other than the default-hi_res_model).
Any feedback/pointers what I can do to improve the result, would be appreciated. Thanks!
To Reproduce The way I call my unstructured-service (hosted on azure) is I think straight forward...
Here's one of the extracted elements, which is faulty...
Expected behavior More of the data from the PDF ending up in text_as_html.
Screenshots This is the corresponding section in the PDF...
And this is what's left in 'text_as_html'...
Environment Info Running the unscripted-api image on azure-VM
Additional context