Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.5k stars 694 forks source link

bug/partition_pdf can't convert table to html table format #956

Closed praguepp closed 1 year ago

praguepp commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

"I used from langchain.document_loaders import UnstructuredFileLoader to convert a PDF that contains text, tables, and images into a text output that only contains text and HTML tables. However, I discovered that no HTML tables are being converted using the method below. How can I correctly call and pass arguments to solve this issue?"

code list below:

from langchain.document_loaders import Docx2txtLoader from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader('1.pdf',mode='single',infer_table_structure=infer_table_structure,strategy='hi_res') document = loader.load()

d = str(document[0])

with open('pdfhtml','w') as f: f.write(d)

To Reproduce Provide a code snippet that reproduces the issue.

@process_metadata() @add_metadata_with_filetype(FileType.PDF) def partition_pdf( filename: str = "", file: Optional[Union[BinaryIO, SpooledTemporaryFile]] = None, include_page_breaks: bool = False, strategy: str = "auto", infer_table_structure: bool = False, ocr_languages: str = "eng", max_partition: Optional[int] = 1500, include_metadata: bool = True, metadata_filename: Optional[str] = None, **kwargs, ) -> List[Element]: """Parses a pdf document into a list of interpreted elements. Parameters

filename
    A string defining the target filename path.
file
    A file-like object as bytes --> open(filename, "rb").
strategy
    The strategy to use for partitioning the PDF. Valid strategies are "hi_res",
    "ocr_only", and "fast". When using the "hi_res" strategy, the function uses
    a layout detection model to identify document elements. When using the
    "ocr_only" strategy, partition_pdf simply extracts the text from the
    document using OCR and processes it. If the "fast" strategy is used, the text
    is extracted directly from the PDF. The default strategy `auto` will determine
    when a page can be extracted using `fast` mode, otherwise it will fall back to `hi_res`.
infer_table_structure
    Only applicable if `strategy=hi_res`.
    If True, any Table elements that are extracted will also have a metadata field
    named "text_as_html" where the table's text content is rendered into an html string.
    I.e., rows and cells are preserved.
    Whether True or False, the "text" field is always present in any Table element
    and is the text content of the table (no structure).
ocr_languages
    The languages to use for the Tesseract agent. To use a language, you'll first need
    to isntall the appropriate Tesseract language pack.
max_partition
    The maximum number of characters to include in a partition. If None is passed,
    no maximum is applied. Only applies to the "ocr_only" strategy.
"""
exactly_one(filename=filename, file=file)
return partition_pdf_or_image(
    filename=filename,
    file=file,
    include_page_breaks=include_page_breaks,
    #strategy=strategy,
    #infer_table_structure=infer_table_structure,
   #set True
    infer_table_structure=True,
   #set hi_res
    strategy='hi_res',
    ocr_languages=ocr_languages,
    max_partition=max_partition,
    **kwargs,
)

Expected behavior A clear and concise description of what you expected to happen.

\n\n将设备跳开网络,联系技术支持确定硬件故障并返修。\n\n缩略语\n\n

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
缩略语 英文全称 中文全称
SNMP Simple Network Management Protocol 简单网络管理协议
RADIUSRemote Authentication Dial In User Service远程用户拨号认证服务
AP Access Point 无线
DNS Domain Name System 域名系统(服务)协议
LDAP Lightweight Directory Access Protocol 轻型目录访问协议
DHCP Dynamic Host Configuration Protocol 动态主机配置协议
ARP Address Resolution Protocol 地址解析协议
TCP Transmission Control Protocol 传输控制协议
VLAN Virtual Local Area Network 虚拟局域网
NAT Network Address Translation 网络地址转换
BBC Branch Bussiness Center BBC
IM Instant Messaging 通讯软件
BA Behavior Awareness 日志分析平台
EDR Endpoint Detection and Response 终端检查响应平台
AD Active Directory 活动目录
HA High Availability 高可用性
MTU Maximum Transmission Unit 最大传输单元
MSS Maximum Segment Size 最大报文段长度
SAAS Software as service 软件级服务
MAB MAC Authentication Bypass 基于mac地址的IEEE 802.1x免认证
\n\n用户手册 密级:公开' metadata={'source': 'test3.docx'}root@9070a3bd3b7d:/#

Screenshots If applicable, add screenshots to help explain your problem. Abbreviations

\n\n Abbreviations English Full Form Chinese Full Form SNMP Simple Network Management Protocol Simple Network Management Protocol RADIUS Remote Authentication Dial In User Service Remote Dial-Up User Authentication Service AP Access Point Wireless DNS Domain Name System Domain Name System(Service) Protocol LDAP Lightweight Directory Access Protocol Lightweight Directory Access Protocol DHCP Dynamic Host Configuration Protocol Dynamic Host Configuration Protocol ARP Address Resolution Protocol Address Resolution Protocol TCP Transmission Control Protocol Transmission Control Protocol VLAN Virtual Local Area Network Virtual Local Area Network NAT Network Address Translation Network Address Translation BBC Branch Bussiness Center BBC IM Instant Messaging Communication Software BA Behavior Awareness Log Analysis Platform EDR Endpoint Detection and Response Endpoint Inspection Response Platform AD Active Directory Active Directory HA High Availability High Availability MTU Maximum Transmission Unit Maximum Transmission Unit MSS Maximum Segment Size Maximum Segment Length SAAS Software as service Software Level Service MAB MAC Authentication Bypass MAC Address-Based IEEE 802.1x Exemption Authentication

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

praguepp commented 1 year ago
strategy
    The strategy to use for partitioning the PDF. Valid strategies are "hi_res",
    "ocr_only", and "fast". When using the "hi_res" strategy, the function uses
    a layout detection model to identify document elements. When using the
    "ocr_only" strategy, partition_pdf simply extracts the text from the
    document using OCR and processes it. If the "fast" strategy is used, the text
    is extracted directly from the PDF. The default strategy `auto` will determine
    when a page can be extracted using `fast` mode, otherwise it will fall back to `hi_res`.
infer_table_structure
    Only applicable if `strategy=hi_res`.
    If True, any Table elements that are extracted will also have a metadata field
    named "text_as_html" where the table's text content is rendered into an html string.
    I.e., rows and cells are preserved.
    Whether True or False, the "text" field is always present in any Table element
    and is the text content of the table (no structure).
MthwRobinson commented 1 year ago

Hi @praguepp - it looks like you're using the loader in "single" mode. You'll need to use the loader in "elements" mode to get the HTML representation of the table. It will be available in the document metadata.