chore: change table extraction defaults

Unstructured-IO / unstructured-api

Apache License 2.0

446 stars 101 forks source link

chore: change table extraction defaults #370

Closed ds-filipknefel closed 3 months ago

ds-filipknefel commented 5 months ago

Change default values for table extraction - relies on this unstructured PR being merged and synced.

We want to move away from pdf_infer_table_structure parameter, in this PR we:

[x] set pdf_infer_table_structure=True and skip_infer_table_types=[] by default
[x] remove it from the examples in documentation
[x] describe it as deprecated in favor of skip_infer_table_types in documentation
[x] update the tests to reflect new usage
[x] bump unstructured dependecy to at least 0.13.0 (REQUIRED)

More detailed description of how we want parameters to interact

if pdf_infer_table_structure is False tables will never extracted from pdf
if pdf_infer_table_structure is True tables will be extracted from pdf unless it's skipped via skip_infer_table_types
on default pdf_infer_table_structure=True and skip_infer_table_types=[]

yuming-long commented 5 months ago

I thought adding table extraction would increase the API response time. Mind if i ask why we are setting this default to true?

ds-filipknefel commented 5 months ago

I thought adding table extraction would increase the API response time. Mind if i ask why we are setting this default to true?

On the request of @pawel-kmiecik, can you chime in for the reasons for it?

pawel-kmiecik commented 5 months ago

I thought adding table extraction would increase the API response time. Mind if i ask why we are setting this default to true?

On the request of @pawel-kmiecik, can you chime in for the reasons for it?

That's what I set with @cragwolfe in private conversation about parameters steering the table extraction:

e.g. with curl, one should just now just use the parameter: -F skip_infer_table_types="[]" to always get tables extracted if they are available for the API

AFAIK both pdf_infer_table_structure="true" and skip_infer_table_types="[]" must be explicitly set by a user to get the table extraction. We'd like to leave only skip_infer_table_types="[]"

yuming-long commented 5 months ago

gotcha, thanks!

yuming-long commented 5 months ago

please remove the readme instruction here: https://github.com/Unstructured-IO/unstructured-api?tab=readme-ov-file#pdf-table-extraction (and maybe some nit on https://github.com/Unstructured-IO/unstructured-api?tab=readme-ov-file#skip-table-extraction) otherwise LGTM!

ds-filipknefel commented 5 months ago

I thought adding table extraction would increase the API response time. Mind if i ask why we are setting this default to true?

On the request of @pawel-kmiecik, can you chime in for the reasons for it?

That's what I set with @cragwolfe in private conversation about parameters steering the table extraction:

e.g. with curl, one should just now just use the parameter: -F skip_infer_table_types="[]" to always get tables extracted if they are available for the API

AFAIK both pdf_infer_table_structure="true" and skip_infer_table_types="[]" must be explicitly set by a user to get the table extraction. We'd like to leave only skip_infer_table_types="[]"

Actually right now it's enough to set pdf_infer_table_structure=true because of how it's handled internally. Logging will throw a warning and tell you to correct yourself but actually ignore skip_infer_table_types contents.

So after this change even if user leaves pdf_infer_table_structure and skip_infer_table_types with default values, extraction will happen.