awslabs / project-lakechain

:zap: Cloud-native, AI-powered, document processing pipelines on AWS.
https://awslabs.github.io/project-lakechain/
Apache License 2.0
108 stars 19 forks source link

Docs: Create own middlewares #43

Open mrtj opened 1 month ago

mrtj commented 1 month ago

What were you searching in the docs?

Currently in the FAQ section of the documentation it is stated that "we are currently working on a developer handbook to help you write your own middlewares. Stay tuned!". But this phrase is unchanged since at least a couple of months. I would like to have more feedback on how to create own middlewares, and suggest to track this documentation enhancement in this issue.

Is this related to an existing documentation section?

https://awslabs.github.io/project-lakechain/general/faq/

How can we improve?

Release the handbook explaining how to create own middlewares.

Acknowledgment

HQarroum commented 1 month ago

Hi @mrtj ! Thanks for your feedback, and yes, indeed, we've been working on this but not prioritizing it immediately. Our target is to deliver the first beta release candidate version of Project Lakechain (it is currently in Alpha) in September, which will contain information as to how developers can create their own middlewares using a stable API.

Out of curiosity, are there some ideas of middlewares that you are able to share with us ?

Thanks!

mrtj commented 1 month ago

Hello,

I have a particular pipeline in mind for parsing PDF files with complex layouts. Example documents might include product brochures, maintenance manuals, and technical guides. These documents tend to contain very mixed content with complex layouts: texts with multiple columns, different kinds of lists, intricate tables with in-table sections and headers, product photos, and technical or wiring diagrams.

I tried various PDF parser Python libraries (pdfminer, pdfplumber, pypdf, pymupdf, etc.), but they often mess up the natural reading order and the table layouts. I also tried passing the page as an image to multimodal LLMs like Claude v3.5; it works quite well but still struggles with complex tables, likely due to insufficient topographic capabilities. Finally, I found that Amazon Textract with the layout feature, combined with the textractor library, works best for converting a PDF page into HTML or Markdown format. However, it does not return useful results from the figures on the page; parsing the text in a technical diagram does not yield anything useful.

So, I came up with the idea of cropping the figure from the image version of the page, passing it to a multimodal LLM with a prompt asking to describe the image as detailed as possible, and then injecting the description back into the page contents as returned by textractor. This text description should contain as much information as possible from the original page and would be most useful in downstream applications like a RAG Q&A agent.

I was wondering if this pipeline can be implemented in Lakechain. I saw that several components are already present, but I did not find anything related to calling Textract yet, as well as some other steps seem to miss as well.

NB: I plan to write an article about this pipeline once I find a robust way of implementing it (hopefully in Lakechain). Please keep this idea within this issue until then.

HQarroum commented 1 month ago

I can echo, across all of our internal experiments, everything you mentioned about PDF parsing, which is quite complicated. For a customer we ended up doing exactly what you mentioned about cropping a portion of a PDF page, and combining Textract with a vision model and it worked quite well in this specific case.

To answer your question, there aren't any textract middlewares because it is quite difficult to abstract away the results provided by Textract into something that other middlewares can consume, but I don't give up the idea.

Regarding your idea of a pipeline, the difficulty would be to identify the figure in a reliable way from the PDF page, especialy if the figure layout tends to change across PDFs (it does not reside within fixed bounding box coordinates), which was our case (hand-written, or scanned pages). We ended up using a table detection model, extracted the bounding box as an image, and passed it to a multimodal model for data extraction (you can do that reliably using a Tool now using the Bedrock API).

I don't think you can do all of that natively using the existing middlewares as you found out. I'd recommend you use a more custom approach to implement this logic (sorry for that!).

mrtj commented 1 month ago

I would like to add some more information about my experiments. First, I cropped the figures based on the bounding box coordinates returned by Textract, further processed by the textractor library. It was really nothing complicated; some pseudo-code (in Python) looked like this:

from textractor import Textractor
from textractor.data.constants import TextractFeatures

def crop_figure(page, figure):
    bbox = figure.bbox
    width, height = page.image.size
    return page.image.crop((
        bbox.x * width,
        bbox.y * height,
        (bbox.x + bbox.width) * width,
        (bbox.y + bbox.height) * height
    ))

extractor = Textractor()
document = extractor.start_document_analysis(
    file_source="./complex-layout.pdf",
    features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES],
    save_image=True
)

for page_idx, page in enumerate(document.pages):
  for fig_idx, fig in enumerate(page.page_layout.figures):
    img = crop_figure(page, fig)
    img.save(f"page{page_idx:04d}-fig{fig_idx:04d}.png")

Regarding saving the Textract output: I would simply save the raw JSON that the Textract service returns in an S3 bucket. Later on, users could parse it, maybe with funclets, or it would be even more convenient to be able to use the textractor library. However, to do so, there should be a way to inject custom, user-written Python code into the Lambda workers, and I have no idea how to do that.

As an alternative solution, I think it would be already very useful to have a middleware that calls Textract, parses the result with textractor, and saves the result in Markdown/HTML format as an output. It would work definitely better then the python pdf parsing libraries.

from textractor.data.html_linearization_config import HTMLLinearizationConfig
config = HTMLLinearizationConfig()
# maybe allow users to customize the linearization config using middleware params
html_text = document.get_text(config)
# save the html_text into s3
HQarroum commented 1 month ago

I tested textractor this evening and it is pretty cool, works very well. I think that creating a Textract middleware based on textractor makes a lot of sense.

I came up with the following temporary design for an API for this middleware. I think it covers most of the capabilities offered by textractor. What do you think ?

Table data extraction. Input(s) : PDF, Images Output(s) : 'markdown' and/or 'text' and/or 'excel' and/or 'csv' and/or 'html'

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new TableExtractionTask.Builder()
.withOutputType('markdown' | 'text' | 'excel' | 'csv' | 'html')
// Defines whether a document will be created for each table,
// or whether to group them all in one document.
.withGroupOutput(false)
.build())
.build();

Key value pair extraction. Input(s) : PDF, Images Output(s) : 'json' | 'csv'

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new KvExtractionTask.Builder()
.withOutputType('json' | 'csv')
.build())
.build();

Visualize task. Input(s) : PDF, Images Output(s) : One or multiple images

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new ImageVisualizationTask.Builder()
.withCheckboxes(true)
.withKeyValues(true)
.withTables(true)
.withSearch('rent', { top_k: 10 })
.build())
.build();

Expense analysis. Input(s) : PDF, Images Output(s) : CSV

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new ExpenseAnalysisTask.Builder()
.withOutputType('csv')
.build())
.build();

ID Analysis. Input(s) : PDF, Images Output(s) : JSON, CSV

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new IdAnalysisTask.Builder()
.withOutputType('json' | 'csv')
.build())
.build();

Layout Analysis. Input(s) : PDF, Images Output(s) : PDF, Images + Metadata Exports layout information in a structured way in the document metadata.

const textract = new TextractProcessor.Builder()
.withScope(this)
.withIdentifier('Trigger')
.withCacheStorage(cache)
.withTask(new LayoutAnalysisTask.Builder()
.build())
.build();
mrtj commented 1 month ago

This would be a really great feature! May I suggest to add also the text linearization function of textractor as described here?

Also, this conversation seems to deviate from the original "create own middlewares" topic, maybe we should continue it in a new issue?

HQarroum commented 1 month ago

Follow up discussion here - https://github.com/awslabs/project-lakechain/issues/46.