Unstructured-IO / unstructured-api

Apache License 2.0
446 stars 101 forks source link

Chunking Parameters don't work as expected #337

Closed JonasDupal closed 6 months ago

JonasDupal commented 6 months ago

Describe the bug When setting the chunking_strategy to by_title and increasing the other chunking Parameters to values above the default, the results don't change. When setting the following values: combine_under_n_chars = 1000000 new_after_n_chars = 1000000 max_characters = 1000000 multipage_sections = true I would have expected to only get returned 1 section containing all the text of the files. But I get the same results as I did when not setting the parameters and using the default values, where I have multiple sections. On the other hand when reducing the parameter values the results differ from the default.

To Reproduce I am using Postman to test the API (https://api.unstructured.io/general/v0/general) and I am sending a pdf file

Environment:

lambda-science commented 6 months ago

Same issue. With:

params = {'skip_infer_table_types': '[]',
                         'chunking_strategy': 'by_title',
                         'combine_under_n_chars': '500',
                         'new_after_n_chars': '2000',
                         'max_characters': '2500',
                         'pdf_infer_table_structure': 'True',
                         'languages': 'eng',
                         'languages': 'fra',
                         'strategy': 'fast'}

I get elems that are like less than 100 caracters sometimes !

scanny commented 6 months ago

@JonasDupal do you have a PDF file you can share with us that reproduces the behavior you're seeing?

If not, can you try partitioning and chunking locally and see if you get the same results?.

Note that Table and TableChunk chunks are never combined with any other chunk so if the PDF contains a lot of tables this could be limiting the amount of combining of small chunks.

scanny commented 6 months ago

@lambda-science the combine_under_n_chars argument can only reduce small chunks, it cannot eliminate them. So you can still see small chunks. The key constraint is that only consecutive chunks can be combined. Also, Table elements are never combined with any other element during chunking.

If you're seeing two consecutive chunks where the first is less than combine_under_n_chars and the second would fit together with the first in under max_characters, then that would indicate a problem.

Let us know if you're seeing that behavior and we can work to reproduce it. A PDF that reproduces the problem speeds things up considerably. Also a good first step is to see if you can reproduce it when partitioning/chunking locally, which would narrow the possible source of the problem.

lambda-science commented 6 months ago

@lambda-science the combine_under_n_chars argument can only reduce small chunks, it cannot eliminate them. So you can still see small chunks. The key constraint is that only consecutive chunks can be combined. Also, Table elements are never combined with any other element during chunking.

If you're seeing two consecutive chunks where the first is less than combine_under_n_chars and the second would fit together with the first in under max_characters, then that would indicate a problem.

Let us know if you're seeing that behavior and we can work to reproduce it. A PDF that reproduces the problem speeds things up considerably. Also a good first step is to see if you can reproduce it when partitioning/chunking locally, which would narrow the possible source of the problem.

Here is a sample PDF: LINK Here is my Unstructured CURL:

curl --location 'http://localhost:8002/general/v0/general' \
--header 'accept: application/json' \
--form 'skip_infer_table_types="[]"' \
--form 'chunking_strategy="by_title"' \
--form 'combine_under_n_chars="1500"' \
--form 'new_after_n_chars="3000"' \
--form 'max_characters="5000"' \
--form 'pdf_infer_table_structure="True"' \
--form 'languages="eng"' \
--form 'languages="fra"' \
--form 'strategy="fast"' \
--form 'files=@"/C:/Users/cmeyer/pdf_example.pdf"'

Here is the answer by the API:

[
    {
        "type": "CompositeElement",
        "element_id": "e21f94d6bac571d84492dd5324171ba2",
        "text": "show lanpower slot chassis/slot status\n\nSyntax Definitions\n\nchassis/slot Display the status for a slot.\n\nDefaults\n\nN/A\n\nPlatforms Supported\n\nThis command is supported on the following OmniSwitch platforms:\n\n6360\n\n6465\n\n6560 6570M 6860 6860N\n\n6865\n\n6900\n\n6900\n\n6900\n\n9900\n\nV72/C32 X48C6/T48C6/\n\nX48C4E/V48C8/ C32E/T24C2/ X24C2\n\nYes\n\nYes\n\nYes\n\nNo\n\nYes\n\nYes\n\nYes\n\nNo\n\nNo\n\nNo\n\nYes\n\nUsage Guidelines\n\nN/A\n\nExamples\n\n> show lanpower slot 1/1 status Chas/Slot Status Init Status 8023BT Priority Capacitor FPoE PPoE F/W Rev Supported Disconnect Detection\n\n--------+--------+--------------+------------+-------------+------------+----------+---------+-----------\n\n1/1 enable initialized enable disable disable disable enable 352",
        "metadata": {
            "languages": [
                "eng",
                "fra"
            ],
            "page_number": 1,
            "filename": "pdf_example.pdf",
            "filetype": "application/pdf"
        }
    },
    {
        "type": "CompositeElement",
        "element_id": "3093c21e805ce419f3c53ae2dc64faa3",
        "text": "Not Available\n\noutput definitions\n\nChas/Slot Chassis/slot. Status The lanpower status.\n\nInit Status The lanpower initialization status.\n\n8023BT Supported 802.3bt support status.\n\nPriority Disconnect Priority disconnect status.\n\nCapacitor Detection Capacitor detection status.\n\nFPoE Fast PoE status.\n\nPPoE Perpetual PoE status.\n\nF/W Rev Firmware revision.\n\nOmniSwitch AOS Release 8 CLI Reference Guide April 2023 page 2-56\n\nPower over Ethernet (PoE) Commands show lanpower status\n\nRelease History\n\nRelease 8.7R2; command was introduced.",
        "metadata": {
            "languages": [
                "eng",
                "fra"
            ],
            "page_number": 1,
            "filename": "pdf_example.pdf",
            "filetype": "application/pdf"
        }
    },
    {
        "type": "CompositeElement",
        "element_id": "774ebb8fc1996a4b3e3c8b63a2e802bf",
        "text": "Related Commands\n\nshow lanpower Displays the PoE status and related statistics for all ports in a specified slot.\n\nMIB Objects\n\nalaPethMainPseAdminStatus alaPethMainPsePriorityDisconnect alaPethMainPseCapacitorDetect alaPethMainPseFastPoE alaPethMainPsePerptualPoE\n\nOmniSwitch AOS Release 8 CLI Reference Guide April 2023 page 2-57",
        "metadata": {
            "languages": [
                "eng",
                "fra"
            ],
            "page_number": 1,
            "filename": "pdf_example.pdf",
            "filetype": "application/pdf"
        }
    }
]

Why are there 3 chunks in this example ?

scanny commented 6 months ago

@lambda-science Okay, thanks to your specimen doc I was able to identify the problem. Turns out that combine_under_n_chars is a typo that should really be combine_text_under_n_chars. So that arg is not making it to the partition() call and is defaulting to 500.

When I make that fix and run it locally the sample PDF appears as a single chunk (CompositeElement).

@JonasDupal this might explain the behavior you're seeing as well.

I'll get to work on fixing this and keep you updated :)

lambda-science commented 6 months ago

@lambda-science Okay, thanks to your specimen doc I was able to identify the problem. Turns out that combine_under_n_chars is a typo that should really be combine_text_under_n_chars. So that arg is not making it to the partition() call and is defaulting to 500.

When I make that fix and run it locally the sample PDF appears as a single chunk (CompositeElement).

@JonasDupal this might explain the behavior you're seeing as well.

I'll get to work on fixing this and keep you updated :)

Great to hear, thank you very much ! :) I guess the doc here should be updated with the correct parameter then https://unstructured-io.github.io/unstructured/apis/api_parameters.html and here also https://github.com/Unstructured-IO/unstructured-api

lambda-science commented 6 months ago

@lambda-science Okay, thanks to your specimen doc I was able to identify the problem. Turns out that combine_under_n_chars is a typo that should really be combine_text_under_n_chars. So that arg is not making it to the partition() call and is defaulting to 500.

When I make that fix and run it locally the sample PDF appears as a single chunk (CompositeElement).

@JonasDupal this might explain the behavior you're seeing as well.

I'll get to work on fixing this and keep you updated :)

However even when defaulting to 500 I don't think it is working properly. I think I'll provide a second example when I have time, to confirm this

lambda-science commented 6 months ago

combine_text_under_n_chars

I'm sorry but your fix doesn't seem to produce a difference. New cURL:

curl --location 'http://localhost:8002/general/v0/general' \
--header 'accept: application/json' \
--form 'skip_infer_table_types="[]"' \
--form 'chunking_strategy="by_title"' \
--form 'combine_text_under_n_chars="1500"' \
--form 'new_after_n_chars="3000"' \
--form 'max_characters="5000"' \
--form 'pdf_infer_table_structure="True"' \
--form 'languages="eng"' \
--form 'languages="fra"' \
--form 'strategy="fast"' \
--form 'files=@"/C:/Users/cmeyer/OneDrive - ALE International/Bureau/pdf_example.pdf"'

see the new combine_text_under_n_chars See the following reponse:

[
    {
        "type": "CompositeElement",
        "element_id": "e21f94d6bac571d84492dd5324171ba2",
        "text": "show lanpower slot chassis/slot status\n\nSyntax Definitions\n\nchassis/slot Display the status for a slot.\n\nDefaults\n\nN/A\n\nPlatforms Supported\n\nThis command is supported on the following OmniSwitch platforms:\n\n6360\n\n6465\n\n6560 6570M 6860 6860N\n\n6865\n\n6900\n\n6900\n\n6900\n\n9900\n\nV72/C32 X48C6/T48C6/\n\nX48C4E/V48C8/ C32E/T24C2/ X24C2\n\nYes\n\nYes\n\nYes\n\nNo\n\nYes\n\nYes\n\nYes\n\nNo\n\nNo\n\nNo\n\nYes\n\nUsage Guidelines\n\nN/A\n\nExamples\n\n> show lanpower slot 1/1 status Chas/Slot Status Init Status 8023BT Priority Capacitor FPoE PPoE F/W Rev Supported Disconnect Detection\n\n--------+--------+--------------+------------+-------------+------------+----------+---------+-----------\n\n1/1 enable initialized enable disable disable disable enable 352",
        "metadata": {
            "languages": [
                "eng",
                "fra"
            ],
            "page_number": 1,
            "filename": "pdf_example.pdf",
            "filetype": "application/pdf"
        }
    },
    {
        "type": "CompositeElement",
        "element_id": "3093c21e805ce419f3c53ae2dc64faa3",
        "text": "Not Available\n\noutput definitions\n\nChas/Slot Chassis/slot. Status The lanpower status.\n\nInit Status The lanpower initialization status.\n\n8023BT Supported 802.3bt support status.\n\nPriority Disconnect Priority disconnect status.\n\nCapacitor Detection Capacitor detection status.\n\nFPoE Fast PoE status.\n\nPPoE Perpetual PoE status.\n\nF/W Rev Firmware revision.\n\nOmniSwitch AOS Release 8 CLI Reference Guide April 2023 page 2-56\n\nPower over Ethernet (PoE) Commands show lanpower status\n\nRelease History\n\nRelease 8.7R2; command was introduced.",
        "metadata": {
            "languages": [
                "eng",
                "fra"
            ],
            "page_number": 1,
            "filename": "pdf_example.pdf",
            "filetype": "application/pdf"
        }
    },
    {
        "type": "CompositeElement",
        "element_id": "774ebb8fc1996a4b3e3c8b63a2e802bf",
        "text": "Related Commands\n\nshow lanpower Displays the PoE status and related statistics for all ports in a specified slot.\n\nMIB Objects\n\nalaPethMainPseAdminStatus alaPethMainPsePriorityDisconnect alaPethMainPseCapacitorDetect alaPethMainPseFastPoE alaPethMainPsePerptualPoE\n\nOmniSwitch AOS Release 8 CLI Reference Guide April 2023 page 2-57",
        "metadata": {
            "languages": [
                "eng",
                "fra"
            ],
            "page_number": 1,
            "filename": "pdf_example.pdf",
            "filetype": "application/pdf"
        }
    }
]

Output is exactly the same as my previous comment with the PDF I provided. Still multiple chunk with very high "combine under X char" 1st chunk 785 char 2nd: chunk 561 char 3rd chunk: 339 char Total is 1685 Why are they not all combined ?

Unstructured Container:

unstructured_api DEBUG pipeline_api input params: {"filename": "pdf_example.pdf", "response_type": "application/json", "m_coordinates": [], "m_encoding": [], "m_hi_res_model_name": [], "m_include_page_breaks": [], "m_ocr_languages": null, "m_pdf_infer_table_structure": ["True"], "m_skip_infer_table_types": ["[]"], "m_strategy": ["fast"], "m_xml_keep_tags": [], "languages": ["eng", "fra"], "m_chunking_strategy": ["by_title"], "m_multipage_sections": [], "m_combine_under_n_chars": ["1500"], "new_after_n_chars": ["3000"], "m_max_characters": ["5000"], "m_extract_image_block_types": null}
unstructured_api DEBUG filetype: application/pdf
unstructured_api DEBUG partition input data: {"content_type": "application/pdf", "strategy": "fast", "ocr_languages": null, "coordinates": false, "pdf_infer_table_structure": false, "include_page_breaks": false, "encoding": null, "hi_res_model_name": null, "xml_keep_tags": false, "skip_infer_table_types": "[]", "languages": ["eng", "fra"], "chunking_strategy": "by_title", "multipage_sections": true, "combine_under_n_chars": 1500, "new_after_n_chars": 3000, "max_characters": 5000, "extract_image_block_types": null, "extract_image_block_to_payload": false}
172.18.0.1:58426 POST /general/v0/general HTTP/1.1 - 200 OK
scanny commented 6 months ago

@lambda-science It wasn't your typo "combine_under_n_chars", it was ours :)

so the curl line should read --form 'combine_under_n_chars="1500"' \.

I'll be adding the ability to use either keyword in a later PR but the "misspelled" keyword has been out there for 4 months and appears in the documentation that way, so first step is to make it work as documented :)

lambda-science commented 6 months ago

@lambda-science It wasn't your typo "combine_under_n_chars", it was ours :)

so the curl line should read --form 'combine_under_n_chars="1500"' \.

I'll be adding the ability to use either keyword in a later PR but the "misspelled" keyword has been out there for 4 months and appears in the documentation that way, so first step is to make it work as documented :)

Update: pulling the latest version from Quay (0.63) https://quay.io/repository/unstructured-io/unstructured-api?tab=tags and using the parameter combine_under_n_chars worked now. Thank you very much.

scanny commented 6 months ago

@lambda-science Glad to hear you got it working :) Thanks for taking the time to provide a reproducing document, that really sped things up :)

@JonasDupal I'll close this assuming it fixes the problem you observed as well. Free free to reopen if not :)