awslabs / project-lakechain

:zap: Cloud-native, AI-powered, document processing pipelines on AWS.
https://awslabs.github.io/project-lakechain/
Apache License 2.0
140 stars 22 forks source link

Using as a package/construct in external cdk apps #32

Open OperationalFallacy opened 7 months ago

OperationalFallacy commented 7 months ago

What were you searching in the docs?

Hi,

I wanted to add the resources from Lakechain into my own cdk app and realized the construct relies on custom-compiled middleware.

Is it possible to use lakechain constructs in such way? Or this monorepo not designed for such use cases?

Thank you

Is this related to an existing documentation section?

No response

How can we improve?

Not sure, if repo structure and published packages allow this, it can be documented.

Acknowledgment

HQarroum commented 7 months ago

Hi @OperationalFallacy ! Yes it is possible to use constructs externally by simply referencing them as NPM module. All packages are available on the NPM registry.

We do however do not document this on purpose, as the NPM packages might break, or change, until we reach 1.0.0. We do not recommend using Lakechain in production for now, and if you do use constructs in an external CDK project, we advise to use fixed versions.

OperationalFallacy commented 7 months ago

Do you have a roadmap by chance? I'm interested in something more robust for managing workflows like amazon-textract-idp-cdk does for example with step functions.

Using it as standalone package, I think it still wants a build step (npm run build-pkg)? image

These are dependencies I've defined in external project

    "@project-lakechain/core@^0.7.0",
    "@project-lakechain/pdf-text-converter@^0.7.0",
    "@project-lakechain/pandoc-text-converter@^0.7.0",
    "@project-lakechain/recursive-character-text-splitter@^0.7.0",
    "@project-lakechain/bedrock-embedding-processors@^0.7.0",
    "@project-lakechain/bedrock-text-processors@^0.7.0",
    '@project-lakechain/pinecone-storage-connector@^0.7.0'
OperationalFallacy commented 5 months ago

Hey @HQarroum I want to follow up on my question. Are you considering adding an option to run async workflows in a step function or using another integration? The use cases are OCR and other types of text recognition with Textract service.

HQarroum commented 5 months ago

Hi Roman,

I'm so sorry I didn't get back to you on this! To answer your first question on the issue, I think that the issue you were encountering with external dependencies is because ESM needs to be enabled. I use the following tsconfig.json which would most likely fix your issue.

{
  "compilerOptions": {
    "target": "es2022",
    "module": "NodeNext",
    "lib": [
      "es2020",
      "dom"
    ],
    "moduleResolution": "nodenext",
    "outDir": "dist/",
    "removeComments": false,
    "declaration": true,
    "strict": true,
    "noImplicitAny": true,
    "strictNullChecks": true,
    "noImplicitThis": true,
    "alwaysStrict": true,
    "noUnusedLocals": false,
    "noUnusedParameters": false,
    "noImplicitReturns": true,
    "noFallthroughCasesInSwitch": false,
    "inlineSourceMap": true,
    "inlineSources": true,
    "experimentalDecorators": true,
    "strictPropertyInitialization": false,
    "typeRoots": ["./node_modules/@types"]
  },
  "exclude": [
    "node_modules",
    "cdk.out",
    "dist/"
  ]
}

Next, regarding the roadmap it is here, but I missed to keep it up to date in the past months due to the rate at which I was investing in new middlewares.

Regarding Step Function integration, I did think about it for many months, and is something I wanted to get, using a asStep() method on a middleware to use it natively with the Step Function API. The biggest problem I hit with this approach, and which I'm trying to solve is that Step Functions and Lakechain work inherently differently. While Step function can execute steps sequentially (or using the Parallel workflow), the steps only produce exactly one output. In Lakechain, this is a bit different, as the framework is designed to support a massively parallel architecture, so a middleware execution with one document as an input can in fact yield 10, 100, or thousands of documents as an output. Think about the case of the PDF processor which can be configured to handle one PDF document as an output, and output in parallel the different pages of that PDF. So because their model is different, I need to find a way to ally both of them while keeping a satisfactory level of performance (avoid Step function polling or anything like that), and the right level of developer experience (beign able to have nice boxes associated with each middlewares in the step function workflow).

Regarding your OCR use case, can't you try to use the AnthropicTextProcessor to use an image model like Sonnet 3.5 to output the content of the pages ?

That was a bit long, sorry for that :).

OperationalFallacy commented 5 months ago

Oh, yeah - the ESM, thanks for pointing out. I've started moving my projects from CommonJS, and that would probably solved the problem.

Regarding step functions, its not necessary. Textract async processing publishes results to SNS topic, so it probably integrates with your framework. Async call can process multiple documents in parallel.

https://docs.aws.amazon.com/textract/latest/dg/api-async.html

They used step functions specifically in amazon-textract-idp-cdk repo for routing and pre-processing logic, I think. Textract has a lot of options and output formats for processing.

The models can't do OCR well, textract and similar services produces structured output, including form data. I'm even using it to extract text from PDF documents that are not images, e.g. they are text documents already. Why not opensource packages to parse PDF? Because the structured output and how robust parsing is.

HQarroum commented 5 months ago

Regarding step functions, its not necessary [...] They used step functions specifically in amazon-textract-idp-cdk repo for routing and pre-processing logic, I think.

Got it!

Textract async processing publishes results to SNS topic, so it probably integrates with your framework. Async call can process multiple documents in parallel.

Definitely, that's possible to create a Textract middleware, or additional engine in the already existing PDF Text Converter middleware.

I'm even using it to extract text from PDF documents that are not images, e.g. they are text documents already. Why not opensource packages to parse PDF?

Great input, I'm adding it to my roadmap. Thanks Roman.

OperationalFallacy commented 5 months ago

You're right. The existing one could be easily refactored to use a different underlying service. I might get to this when I need more file types processed. I'll probably convert the existing OCR sfn into middleware.

Thank you for a great project!