Closed praguepp closed 1 year ago
strategy
The strategy to use for partitioning the PDF. Valid strategies are "hi_res",
"ocr_only", and "fast". When using the "hi_res" strategy, the function uses
a layout detection model to identify document elements. When using the
"ocr_only" strategy, partition_pdf simply extracts the text from the
document using OCR and processes it. If the "fast" strategy is used, the text
is extracted directly from the PDF. The default strategy `auto` will determine
when a page can be extracted using `fast` mode, otherwise it will fall back to `hi_res`.
infer_table_structure
Only applicable if `strategy=hi_res`.
If True, any Table elements that are extracted will also have a metadata field
named "text_as_html" where the table's text content is rendered into an html string.
I.e., rows and cells are preserved.
Whether True or False, the "text" field is always present in any Table element
and is the text content of the table (no structure).
Hi @praguepp - it looks like you're using the loader in "single"
mode. You'll need to use the loader in "elements"
mode to get the HTML representation of the table. It will be available in the document metadata.
Describe the bug A clear and concise description of what the bug is.
"I used from langchain.document_loaders import UnstructuredFileLoader to convert a PDF that contains text, tables, and images into a text output that only contains text and HTML tables. However, I discovered that no HTML tables are being converted using the method below. How can I correctly call and pass arguments to solve this issue?"
code list below:
from langchain.document_loaders import Docx2txtLoader from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader('1.pdf',mode='single',infer_table_structure=infer_table_structure,strategy='hi_res') document = loader.load()
d = str(document[0])
with open('pdfhtml','w') as f: f.write(d)
To Reproduce Provide a code snippet that reproduces the issue.
@process_metadata() @add_metadata_with_filetype(FileType.PDF) def partition_pdf( filename: str = "", file: Optional[Union[BinaryIO, SpooledTemporaryFile]] = None, include_page_breaks: bool = False, strategy: str = "auto", infer_table_structure: bool = False, ocr_languages: str = "eng", max_partition: Optional[int] = 1500, include_metadata: bool = True, metadata_filename: Optional[str] = None, **kwargs, ) -> List[Element]: """Parses a pdf document into a list of interpreted elements. Parameters
Expected behavior A clear and concise description of what you expected to happen.
\n\n将设备跳开网络,联系技术支持确定硬件故障并返修。\n\n缩略语\n\n
Screenshots If applicable, add screenshots to help explain your problem. Abbreviations
\n\n Abbreviations English Full Form Chinese Full Form SNMP Simple Network Management Protocol Simple Network Management Protocol RADIUS Remote Authentication Dial In User Service Remote Dial-Up User Authentication Service AP Access Point Wireless DNS Domain Name System Domain Name System(Service) Protocol LDAP Lightweight Directory Access Protocol Lightweight Directory Access Protocol DHCP Dynamic Host Configuration Protocol Dynamic Host Configuration Protocol ARP Address Resolution Protocol Address Resolution Protocol TCP Transmission Control Protocol Transmission Control Protocol VLAN Virtual Local Area Network Virtual Local Area Network NAT Network Address Translation Network Address Translation BBC Branch Bussiness Center BBC IM Instant Messaging Communication Software BA Behavior Awareness Log Analysis Platform EDR Endpoint Detection and Response Endpoint Inspection Response Platform AD Active Directory Active Directory HA High Availability High Availability MTU Maximum Transmission Unit Maximum Transmission Unit MSS Maximum Segment Size Maximum Segment Length SAAS Software as service Software Level Service MAB MAC Authentication Bypass MAC Address-Based IEEE 802.1x Exemption Authentication
Desktop (please complete the following information):
Additional context Add any other context about the problem here.