[TASK]: Cache Document extraction

YohannParis commented 1 month ago

Describe the task

Right now, every time we send a PDF/Markdown file to become a Document asset, we run the knowledge extraction pipeline.
We should cache those results (S3/Redis/PostGres ???) to avoid re-running the pipelines.
For now we could just check a simple file size or SHA-256/MD5 checksum similarity.
This is important once we are using costly ML running on GPUs.

kbirk commented 3 weeks ago

This is a bit tricky based on how it is currently done. It isn't an operation that returns a response that can be cached. It's a very long method that does several incremental changes that involve updating an object in place such that values are available for subsequent processing.

The proper way to do this would require splitting the logic into one stateless method that returns a final response containing all information required, and then another method that takes that response and applies it to a document.

This will let us cache the responses and then apply it to documents with the same input.

YohannParis commented 1 week ago

@dvince2 @dgauldie how would you like to proceed on this? I think it's important to do the caching for that purpose.

YohannParis commented 1 week ago

Per discussion with @dgauldie:

Add a new column to DocumentAsset table to contain a hash of the document used to create it. Could be a simple MD5 plus size.
On upload of a document we compare the hash, and if existing we clone the asset that was created by that hash.

DARPA-ASKEM / terarium

[TASK]: Cache Document extraction #4418