DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
10.48k stars 507 forks source link

cannot import name 'TextPipelineOptions' from 'docling.datamodel.pipeline_options' #360

Closed adrianzhang closed 4 days ago

adrianzhang commented 4 days ago

Bug

When running py script which depends on docling, it always tell me: {doc type}PipelineOptions can not be imported. These types are HTML Doc Text.... ...

Steps to reproduce

My code:

import os import sys from pathlib import Path

from docling.document_converter import DocumentConverter from docling.datamodel.pipeline_options import ( PdfPipelineOptions, TextPipelineOptions, RtfPipelineOptions, EasyOcrOptions ) from docling.datamodel.base_models import InputFormat

def process_document(input_file):

Create a DocumentConverter instance

converter = DocumentConverter()

# Determine the input file type
input_file_type = input_file.suffix.lower()

# Set up pipeline options based on input file type
if input_file_type == '.pdf':
    pipeline_options = PdfPipelineOptions()
    if is_image_pdf(input_file):
        pipeline_options.do_ocr = True
        pipeline_options.ocr_options = EasyOcrOptions()
    else:
        pipeline_options.do_ocr = False
elif input_file_type == '.txt':
    pipeline_options = TextPipelineOptions()
elif input_file_type == '.rtf':
    pipeline_options = RtfPipelineOptions()
elif input_file_type == '.html':
    pipeline_options = PdfPipelineOptions()  # 使用 PdfPipelineOptions 代替
elif input_file_type == '.md':
    pipeline_options = PdfPipelineOptions()  # 使用 PdfPipelineOptions 代替
else:
    print(f"Unsupported file type: {input_file_type}")
    return

# Convert the document
result = converter.convert(input_file, pipeline_options=pipeline_options)

# Check if the conversion was successful
if result.status == ConversionStatus.SUCCESS:
    # Get the converted markdown text
    markdown_text = result.document.export_to_markdown()

    # Save the markdown text to a file in the converted directory
    output_dir = Path("converted")
    output_dir.mkdir(parents=True, exist_ok=True)
    output_file = output_dir / f"{input_file.stem}.md"
    with output_file.open("w") as f:
        f.write(markdown_text)

    print(f"Document converted successfully: {input_file.name}")
else:
    print(f"Document conversion failed: {input_file.name}")

def is_image_pdf(input_file):

Check if the input PDF is an image-based PDF

with input_file.open('rb') as f:
    first_page = f.read(1024)
    return first_page.startswith(b'%PDF-')

if name == "main": if len(sys.argv) != 2: print("Usage: python process_docs.py ") sys.exit(1)

input_file = Path(sys.argv[1])
process_document(input_file)

...

Docling version

docling 2.5.2 docling-core 2.3.2 docling-ibm-models 2.0.3 docling-parse 2.0.4 ...

Python version

python --version Python 3.11.1 ...

dolfim-ibm commented 4 days ago

Where are getting these code? Those imports are not part of Docling.

TextPipelineOptions,
RtfPipelineOptions,

I would suggest looking at the examples we have in the docs: https://ds4sd.github.io/docling/examples/.