bug/partition_pdf can't convert table to html table format

praguepp commented 1 year ago

Describe the bug A clear and concise description of what the bug is.

"I used from langchain.document_loaders import UnstructuredFileLoader to convert a PDF that contains text, tables, and images into a text output that only contains text and HTML tables. However, I discovered that no HTML tables are being converted using the method below. How can I correctly call and pass arguments to solve this issue?"

code list below:

from langchain.document_loaders import Docx2txtLoader from langchain.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader('1.pdf',mode='single',infer_table_structure=infer_table_structure,strategy='hi_res') document = loader.load()

d = str(document[0])

with open('pdfhtml','w') as f: f.write(d)

To Reproduce Provide a code snippet that reproduces the issue.

@process_metadata() @add_metadata_with_filetype(FileType.PDF) def partition_pdf( filename: str = "", file: Optional[Union[BinaryIO, SpooledTemporaryFile]] = None, include_page_breaks: bool = False, strategy: str = "auto", infer_table_structure: bool = False, ocr_languages: str = "eng", max_partition: Optional[int] = 1500, include_metadata: bool = True, metadata_filename: Optional[str] = None, **kwargs, ) -> List[Element]: """Parses a pdf document into a list of interpreted elements. Parameters

filename
    A string defining the target filename path.
file
    A file-like object as bytes --> open(filename, "rb").
strategy
    The strategy to use for partitioning the PDF. Valid strategies are "hi_res",
    "ocr_only", and "fast". When using the "hi_res" strategy, the function uses
    a layout detection model to identify document elements. When using the
    "ocr_only" strategy, partition_pdf simply extracts the text from the
    document using OCR and processes it. If the "fast" strategy is used, the text
    is extracted directly from the PDF. The default strategy `auto` will determine
    when a page can be extracted using `fast` mode, otherwise it will fall back to `hi_res`.
infer_table_structure
    Only applicable if `strategy=hi_res`.
    If True, any Table elements that are extracted will also have a metadata field
    named "text_as_html" where the table's text content is rendered into an html string.
    I.e., rows and cells are preserved.
    Whether True or False, the "text" field is always present in any Table element
    and is the text content of the table (no structure).
ocr_languages
    The languages to use for the Tesseract agent. To use a language, you'll first need
    to isntall the appropriate Tesseract language pack.
max_partition
    The maximum number of characters to include in a partition. If None is passed,
    no maximum is applied. Only applies to the "ocr_only" strategy.
"""
exactly_one(filename=filename, file=file)
return partition_pdf_or_image(
    filename=filename,
    file=file,
    include_page_breaks=include_page_breaks,
    #strategy=strategy,
    #infer_table_structure=infer_table_structure,
   #set True
    infer_table_structure=True,
   #set hi_res
    strategy='hi_res',
    ocr_languages=ocr_languages,
    max_partition=max_partition,
    **kwargs,
)

Expected behavior A clear and concise description of what you expected to happen.

\n\n将设备跳开网络，联系技术支持确定硬件故障并返修。\n\n缩略语\n\n

\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n

缩略语	英文全称	中文全称
SNMP	Simple Network Management Protocol	简单网络管理协议
RADIUS	Remote Authentication Dial In User Service	远程用户拨号认证服务
AP	Access Point	无线
DNS	Domain Name System	域名系统（服务）协议
LDAP	Lightweight Directory Access Protocol	轻型目录访问协议
DHCP	Dynamic Host Configuration Protocol	动态主机配置协议
ARP	Address Resolution Protocol	地址解析协议
TCP	Transmission Control Protocol	传输控制协议
VLAN	Virtual Local Area Network	虚拟局域网
NAT	Network Address Translation	网络地址转换
BBC	Branch Bussiness Center	BBC
IM	Instant Messaging	通讯软件
BA	Behavior Awareness	日志分析平台
EDR	Endpoint Detection and Response	终端检查响应平台
AD	Active Directory	活动目录
HA	High Availability	高可用性
MTU	Maximum Transmission Unit	最大传输单元
MSS	Maximum Segment Size	最大报文段长度
SAAS	Software as service	软件级服务
MAB	MAC Authentication Bypass	基于mac地址的IEEE 802.1x免认证

\n\n用户手册密级：公开' metadata={'source': 'test3.docx'}root@9070a3bd3b7d:/#

Screenshots If applicable, add screenshots to help explain your problem. Abbreviations

\n\n Abbreviations English Full Form Chinese Full Form SNMP Simple Network Management Protocol Simple Network Management Protocol RADIUS Remote Authentication Dial In User Service Remote Dial-Up User Authentication Service AP Access Point Wireless DNS Domain Name System Domain Name System(Service) Protocol LDAP Lightweight Directory Access Protocol Lightweight Directory Access Protocol DHCP Dynamic Host Configuration Protocol Dynamic Host Configuration Protocol ARP Address Resolution Protocol Address Resolution Protocol TCP Transmission Control Protocol Transmission Control Protocol VLAN Virtual Local Area Network Virtual Local Area Network NAT Network Address Translation Network Address Translation BBC Branch Bussiness Center BBC IM Instant Messaging Communication Software BA Behavior Awareness Log Analysis Platform EDR Endpoint Detection and Response Endpoint Inspection Response Platform AD Active Directory Active Directory HA High Availability High Availability MTU Maximum Transmission Unit Maximum Transmission Unit MSS Maximum Segment Size Maximum Segment Length SAAS Software as service Software Level Service MAB MAC Authentication Bypass MAC Address-Based IEEE 802.1x Exemption Authentication

Desktop (please complete the following information):

OS: [e.g. windows, mac, linux]
Browser [e.g. chrome, safari]
Python version [e.g. 3.8.15]

Additional context Add any other context about the problem here.

praguepp commented 1 year ago

strategy
    The strategy to use for partitioning the PDF. Valid strategies are "hi_res",
    "ocr_only", and "fast". When using the "hi_res" strategy, the function uses
    a layout detection model to identify document elements. When using the
    "ocr_only" strategy, partition_pdf simply extracts the text from the
    document using OCR and processes it. If the "fast" strategy is used, the text
    is extracted directly from the PDF. The default strategy `auto` will determine
    when a page can be extracted using `fast` mode, otherwise it will fall back to `hi_res`.
infer_table_structure
    Only applicable if `strategy=hi_res`.
    If True, any Table elements that are extracted will also have a metadata field
    named "text_as_html" where the table's text content is rendered into an html string.
    I.e., rows and cells are preserved.
    Whether True or False, the "text" field is always present in any Table element
    and is the text content of the table (no structure).

MthwRobinson commented 1 year ago

Hi @praguepp - it looks like you're using the loader in "single" mode. You'll need to use the loader in "elements" mode to get the HTML representation of the table. It will be available in the document metadata.

Unstructured-IO / unstructured

bug/partition_pdf can't convert table to html table format #956