Unstructured-IO / unstructured-api

Apache License 2.0
528 stars 110 forks source link

build(deps): bump unstructured[local-inference] from 0.10.16 to 0.10.18 in /requirements #263

Closed dependabot[bot] closed 1 year ago

dependabot[bot] commented 1 year ago

Bumps unstructured[local-inference] from 0.10.16 to 0.10.18.

Release notes

Sourced from unstructured[local-inference]'s releases.

0.10.18

Enhancements

  • Better detection of natural reading order in images and PDF's The elements returned by partition better reflect natural reading order in some cases, particularly in complicated multi-column layouts, leading to better chunking and retrieval for downstream applications. Achieved by improving the xy-cut sorting to preprocess bboxes, shrinking all bounding boxes by 90% along x and y axes (still centered around the same center point), which allows projection lines to be drawn where not possible before if layout bboxes overlapped.
  • Improves partition_xml to be faster and more memory efficient when partitioning large XML files The new behavior is to partition iteratively to prevent loading the entire XML tree into memory at once in most use cases.
  • Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, Slack, and DeltaTable connectors These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
  • Add functionality to save embedded images in PDF's separately as images This allows users to save embedded images in PDF's separately as images, given some directory path. The saved image path is written to the metadata for the Image element. Downstream applications may benefit by providing users with image links from relevant "hits."
  • Azure Cognitive Search destination connector New Azure Cognitive Search destination connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned data from over 20 data sources (so far) to an Azure Cognitive Search index.
  • Improves salesforce partitioning Partitions Salesforce data as xlm instead of text for improved detail and flexibility. Partitions htmlbody instead of textbody for Salesforce emails. Importance: Allows all Salesforce fields to be ingested and gives Salesforce emails more detailed partitioning.
  • Add document level language detection functionality. Introduces the "auto" default for the languages param, which then detects the languages present in the document using the langdetect package. Adds the document languages as ISO 639-3 codes to the element metadata. Implemented only for the partition_text function to start.
  • PPTX partitioner refactored in preparation for enhancement. Behavior should be unchanged except that shapes enclosed in a group-shape are now included, as many levels deep as required (a group-shape can itself contain a group-shape).
  • Embeddings support for the SharePoint SourceConnector via unstructured-ingest CLI The SharePoint connector can now optionally create embeddings from the elements it pulls out during partition and upload those embeddings to Azure Cognitive Search index.
  • Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number).
  • Chunking support for the SharePoint SourceConnector via unstructured-ingest CLI The SharePoint connector can now optionally chunk the elements pulled out during partition via the chunking unstructured brick. This can be used as a stage before creating embeddings.

Features

  • Adds links metadata in partition_pdf for fast strategy. Problem: PDF files contain rich information and hyperlink that Unstructured did not captured earlier. Feature: partition_pdf now can capture embedded links within the file along with its associated text and page number. Importance: Providing depth in extracted elements give user a better understanding and richer context of documents. This also enables user to map to other elements within the document if the hyperlink is refered internally.
  • Adds the embedding module to be able to embed Elements Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library. Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the embeddings property. The module is also able to embed query strings. Importance: Ability to embed documents or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation.

Fixes

  • Fixes a metadata source serialization bug Problem: In unstructured elements, when loading an elements json file from the disk, the data_source attribute is assumed to be an instance of DataSourceMetadata and the code acts based on that. However the loader did not satisfy the assumption, and loaded it as a dict instead, causing an error. Fix: Added necessary code block to initialize a DataSourceMetadata object, also refactored DataSourceMetadata.from_dict() method to remove redundant code. Importance: Crucial to be able to load elements (which have data_source fields) from json files.
  • Fixes issue where unstructured-inference was not getting updated Problem: unstructured-inference was not getting upgraded to the version to match unstructured release when doing a pip install. Solution: using pip install unstructured[all-docs] it will now upgrade both unstructured and unstructured-inference. Importance: This will ensure that the inference library is always in sync with the unstructured library, otherwise users will be using outdated libraries which will likely lead to unintended behavior.
  • Fixes SharePoint connector failures if any document has an unsupported filetype Problem: Currently the entire connector ingest run fails if a single IngestDoc has an unsupported filetype. This is because a ValueError is raised in the IngestDoc's __post_init__. Fix: Adds a try/catch when the IngestConnector runs get_ingest_docs such that the error is logged but all processable documents->IngestDocs are still instantiated and returned. Importance: Allows users to ingest SharePoint content even when some files with unsupported filetypes exist there.
  • Fixes Sharepoint connector server_path issue Problem: Server path for the Sharepoint Ingest Doc was incorrectly formatted, causing issues while fetching pages from the remote source. Fix: changes formatting of remote file path before instantiating SharepointIngestDocs and appends a '/' while fetching pages from the remote source. Importance: Allows users to fetch pages from Sharepoint Sites.
  • Fixes badly initialized Formula Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.
  • Fixes Sphinx errors. Fixes errors when running Sphinx make html and installs library to suppress warnings.
  • Fixes a metadata backwards compatibility error Problem: When calling partition_via_api, the hosted api may return an element schema that's newer than the current unstructured. In this case, metadata fields were added which did not exist in the local ElementMetadata dataclass, and __init__() threw an error. Fix: remove nonexistent fields before instantiating in ElementMetadata.from_json(). Importance: Crucial to avoid breaking changes when adding fields.
  • Fixes issue with Discord connector when a channel returns None Problem: Getting the jump_url from a nonexistent Discord channel fails. Fix: property jump_url is now retrieved within the same context as the messages from the channel. Importance: Avoids cascading issues when the connector fails to fetch information about a Discord channel.
  • Fixes occasionally SIGABTR when writing table with deltalake on Linux Problem: occasionally on Linux ingest can throw a SIGABTR when writing deltalake table even though the table was written correctly. Fix: put the writing function into a Process to ensure its execution to the fullest extent before returning to the main process. Importance: Improves stability of connectors using deltalake
Changelog

Sourced from unstructured[local-inference]'s changelog.

0.10.18

Enhancements

  • Better detection of natural reading order in images and PDF's The elements returned by partition better reflect natural reading order in some cases, particularly in complicated multi-column layouts, leading to better chunking and retrieval for downstream applications. Achieved by improving the xy-cut sorting to preprocess bboxes, shrinking all bounding boxes by 90% along x and y axes (still centered around the same center point), which allows projection lines to be drawn where not possible before if layout bboxes overlapped.
  • Improves partition_xml to be faster and more memory efficient when partitioning large XML files The new behavior is to partition iteratively to prevent loading the entire XML tree into memory at once in most use cases.
  • Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, Slack, and DeltaTable connectors These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
  • Add functionality to save embedded images in PDF's separately as images This allows users to save embedded images in PDF's separately as images, given some directory path. The saved image path is written to the metadata for the Image element. Downstream applications may benefit by providing users with image links from relevant "hits."
  • Azure Cognite Search destination connector New Azure Cognitive Search destination connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned data from over 20 data sources (so far) to an Azure Cognitive Search index.
  • Improves salesforce partitioning Partitions Salesforce data as xlm instead of text for improved detail and flexibility. Partitions htmlbody instead of textbody for Salesforce emails. Importance: Allows all Salesforce fields to be ingested and gives Salesforce emails more detailed partitioning.
  • Add document level language detection functionality. Introduces the "auto" default for the languages param, which then detects the languages present in the document using the langdetect package. Adds the document languages as ISO 639-3 codes to the element metadata. Implemented only for the partition_text function to start.
  • PPTX partitioner refactored in preparation for enhancement. Behavior should be unchanged except that shapes enclosed in a group-shape are now included, as many levels deep as required (a group-shape can itself contain a group-shape).
  • Embeddings support for the SharePoint SourceConnector via unstructured-ingest CLI The SharePoint connector can now optionally create embeddings from the elements it pulls out during partition and upload those embeddings to Azure Cognitive Search index.
  • Improves hierarchy from docx files by leveraging natural hierarchies built into docx documents Hierarchy can now be detected from an indentation level for list bullets/numbers and by style name (e.g. Heading 1, List Bullet 2, List Number).
  • Chunking support for the SharePoint SourceConnector via unstructured-ingest CLI The SharePoint connector can now optionally chunk the elements pulled out during partition via the chunking unstructured brick. This can be used as a stage before creating embeddings.

Features

  • Adds links metadata in partition_pdf for fast strategy. Problem: PDF files contain rich information and hyperlink that Unstructured did not captured earlier. Feature: partition_pdf now can capture embedded links within the file along with its associated text and page number. Importance: Providing depth in extracted elements give user a better understanding and richer context of documents. This also enables user to map to other elements within the document if the hyperlink is refered internally.
  • Adds the embedding module to be able to embed Elements Problem: Many NLP applications require the ability to represent parts of documents in a semantic way. Until now, Unstructured did not have text embedding ability within the core library. Feature: This embedding module is able to track embeddings related data with a class, embed a list of elements, and return an updated list of Elements with the embeddings property. The module is also able to embed query strings. Importance: Ability to embed documents or parts of documents will enable users to make use of these semantic representations in different NLP applications, such as search, retrieval, and retrieval augmented generation.

Fixes

  • Fixes a metadata source serialization bug Problem: In unstructured elements, when loading an elements json file from the disk, the data_source attribute is assumed to be an instance of DataSourceMetadata and the code acts based on that. However the loader did not satisfy the assumption, and loaded it as a dict instead, causing an error. Fix: Added necessary code block to initialize a DataSourceMetadata object, also refactored DataSourceMetadata.from_dict() method to remove redundant code. Importance: Crucial to be able to load elements (which have data_source fields) from json files.
  • Fixes issue where unstructured-inference was not getting updated Problem: unstructured-inference was not getting upgraded to the version to match unstructured release when doing a pip install. Solution: using pip install unstructured[all-docs] it will now upgrade both unstructured and unstructured-inference. Importance: This will ensure that the inference library is always in sync with the unstructured library, otherwise users will be using outdated libraries which will likely lead to unintended behavior.
  • Fixes SharePoint connector failures if any document has an unsupported filetype Problem: Currently the entire connector ingest run fails if a single IngestDoc has an unsupported filetype. This is because a ValueError is raised in the IngestDoc's __post_init__. Fix: Adds a try/catch when the IngestConnector runs get_ingest_docs such that the error is logged but all processable documents->IngestDocs are still instantiated and returned. Importance: Allows users to ingest SharePoint content even when some files with unsupported filetypes exist there.
  • Fixes Sharepoint connector server_path issue Problem: Server path for the Sharepoint Ingest Doc was incorrectly formatted, causing issues while fetching pages from the remote source. Fix: changes formatting of remote file path before instantiating SharepointIngestDocs and appends a '/' while fetching pages from the remote source. Importance: Allows users to fetch pages from Sharepoint Sites.
  • Fixes badly initialized Formula Problem: YoloX contain new types of elements, when loading a document that contain formulas a new element of that class should be generated, however the Formula class inherits from Element instead of Text. After this change the element is correctly created with the correct class allowing the document to be loaded. Fix: Change parent class for Formula to Text. Importance: Crucial to be able to load documents that contain formulas.
  • Fixes Sphinx errors. Fixes errors when running Sphinx make html and installs library to suppress warnings.
  • Fixes a metadata backwards compatibility error Problem: When calling partition_via_api, the hosted api may return an element schema that's newer than the current unstructured. In this case, metadata fields were added which did not exist in the local ElementMetadata dataclass, and __init__() threw an error. Fix: remove nonexistent fields before instantiating in ElementMetadata.from_json(). Importance: Crucial to avoid breaking changes when adding fields.
  • Fixes issue with Discord connector when a channel returns None Problem: Getting the jump_url from a nonexistent Discord channel fails. Fix: property jump_url is now retrieved within the same context as the messages from the channel. Importance: Avoids cascading issues when the connector fails to fetch information about a Discord channel.
  • Fixes occasionally SIGABTR when writing table with deltalake on Linux Problem: occasionally on Linux ingest can throw a SIGABTR when writing deltalake table even though the table was written correctly. Fix: put the writing function into a Process to ensure its execution to the fullest extent before returning to the main process. Importance: Improves stability of connectors using deltalake
Commits
  • 5b994f3 build(release): actually make the release 0.10.18 (#1576)
  • e0e329c build(release): cut release for 0.10.7 (#1575)
  • 44f5605 build(image): call python3 not python for image compat (#1574)
  • 94fbbed feat: bbox shrinking in xycut algo, better natural reading order (#1560)
  • cd8c6a2 fix: occasional SIGABRT with deltalake writer on Linux (#1567)
  • 4e84e32 fix: Discord connector when a channel is not found. (#1480)
  • 792232d Chore: move scarf to setup.py (#1569)
  • e5d0866 enhancement: memory efficient xml partitioning (#1547)
  • 62b0557 build: ignore failing delta lake test ingest for now (#1557)
  • 2e01c49 feat: adds data source properties to delta table connector. (#1464)
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot will merge this PR once CI passes on it, as requested by @awalker4.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)