Unstructured-IO / unstructured-api

Apache License 2.0
528 stars 110 forks source link

build(deps): bump unstructured[local-inference] from 0.10.18 to 0.10.19 in /requirements #272

Closed dependabot[bot] closed 1 year ago

dependabot[bot] commented 1 year ago

Bumps unstructured[local-inference] from 0.10.18 to 0.10.19.

Release notes

Sourced from unstructured[local-inference]'s releases.

0.10.19

Enhancements

  • Adds XLSX document level language detection Enhancing on top of language detection functionality in previous release, we now support language detection within .xlsx file type at Element level.
  • bump unstructured-inference to 0.6.6 The updated version of unstructured-inference makes table extraction in hi_res mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the hi_res partitioning of pdfs and images.
  • Detect text in HTML Heading Tags as Titles This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
  • Update python-based docs Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
  • Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
  • Adds Table support for the add_chunking_strategy decorator to partition functions. In addition to combining elements under Title elements, user's can now specify the max_characters=<n> argument to chunk Table elements into TableChunk elements with text and text_as_html of length characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
  • Expose endpoint url for s3 connectors By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
  • change default hi_res model for pdf/image partition to yolox Now partitioning pdf/image using hi_res strategy utilizes yolox_quantized model isntead of detectron2_onnx model. This new default model has better recall for tables and produces more detailed categories for elements.
  • XLSX can now reads subtables within one sheet Problem: Many .xlsx files are not created to be read as one full table per sheet. There are subtables, text and header along with more informations to extract from each sheet. Feature: This partition_xlsx now can reads subtable(s) within one .xlsx sheet, along with extracting other title and narrative texts. Importance: This enhance the power of .xlsx reading to not only one table per sheet, allowing user to capture more data tables from the file, if exists.
  • Update Documentation on Element Types and Metadata: We have updated the documentation according to the latest element types and metadata. It includes the common and additional metadata provided by the Partitions and Connectors.

Fixes

  • Fixes partition_pdf is_alnum reference bug Problem: The partition_pdf when attempt to get bounding box from element experienced a reference before assignment error when the first object is not text extractable. Fix: Switched to a flag when the condition is met. Importance: Crucial to be able to partition with pdf.
  • Fix various cases of HTML text missing after partition Problem: Under certain circumstances, text immediately after some HTML tags will be misssing from partition result. Fix: Updated code to deal with these cases. Importance: This will ensure the correctness when partitioning HTML and Markdown documents.
  • Fixes chunking when detection_class_prob appears in Element metadata Problem: when detection_class_prob appears in Element metadata, Elements will only be combined by chunk_by_title if they have the same detection_class_prob value (which is rare). This is unlikely a case we ever need to support and most often results in no chunking. Fix: detection_class_prob is included in the chunking list of metadata keys excluded for similarity comparison. Importance: This change allows chunk_by_title to operate as intended for documents which include detection_class_prob metadata in their Elements.
Changelog

Sourced from unstructured[local-inference]'s changelog.

0.10.19

Enhancements

  • Adds XLSX document level language detection Enhancing on top of language detection functionality in previous release, we now support language detection within .xlsx file type at Element level.
  • bump unstructured-inference to 0.6.6 The updated version of unstructured-inference makes table extraction in hi_res mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the hi_res partitioning of pdfs and images.
  • Detect text in HTML Heading Tags as Titles This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
  • Update python-based docs Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
  • Adds Table support for the add_chunking_strategy decorator to partition functions. In addition to combining elements under Title elements, user's can now specify the max_characters=<n> argument to chunk Table elements into TableChunk elements with text and text_as_html of length characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
  • Expose endpoint url for s3 connectors By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).
  • change default hi_res model for pdf/image partition to yolox Now partitioning pdf/image using hi_res strategy utilizes yolox_quantized model isntead of detectron2_onnx model. This new default model has better recall for tables and produces more detailed categories for elements.
  • XLSX can now reads subtables within one sheet Problem: Many .xlsx files are not created to be read as one full table per sheet. There are subtables, text and header along with more informations to extract from each sheet. Feature: This partition_xlsx now can reads subtable(s) within one .xlsx sheet, along with extracting other title and narrative texts. Importance: This enhance the power of .xlsx reading to not only one table per sheet, allowing user to capture more data tables from the file, if exists.
  • Update Documentation on Element Types and Metadata: We have updated the documentation according to the latest element types and metadata. It includes the common and additional metadata provided by the Partitions and Connectors.

Fixes

  • Fixes partition_pdf is_alnum reference bug Problem: The partition_pdf when attempt to get bounding box from element experienced a reference before assignment error when the first object is not text extractable. Fix: Switched to a flag when the condition is met. Importance: Crucial to be able to partition with pdf.
  • Fix various cases of HTML text missing after partition Problem: Under certain circumstances, text immediately after some HTML tags will be misssing from partition result. Fix: Updated code to deal with these cases. Importance: This will ensure the correctness when partitioning HTML and Markdown documents.
  • Fixes chunking when detection_class_prob appears in Element metadata Problem: when detection_class_prob appears in Element metadata, Elements will only be combined by chunk_by_title if they have the same detection_class_prob value (which is rare). This is unlikely a case we ever need to support and most often results in no chunking. Fix: detection_class_prob is included in the chunking list of metadata keys excluded for similarity comparison. Importance: This change allows chunk_by_title to operate as intended for documents which include detection_class_prob metadata in their Elements.
Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot will merge this PR once CI passes on it, as requested by @awalker4.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
awalker4 commented 1 year ago

@dependabot merge