epam / badgerdoc

Apache License 2.0
31 stars 32 forks source link

Add an S3 Signed URL as an Argument to Pipelines #840

Closed khyurri closed 3 months ago

khyurri commented 4 months ago

Currently, BadgerDoc executes pipelines with the files_data parameter under the assumption that the pipeline engine can access the S3 file without any extra configuration (stores credentials on its own side). However, we need to enable pipeline engines to download S3 files without storing authentication credentials on their side. Therefore, this task involves adding the capability to generate S3 signed URLs as passed arguments to the pipeline manager

The PR https://github.com/epam/badgerdoc/pull/839 adds functionality to execute pipelines with additional arguments using the dataclass:

@dataclass
class PipelineFile:
    bucket: str
    input: PipelineFileInput
    input_path: str
    pages: List[int]
    output_path: Optional[str] = None
    s3_signed_url: Optional[str] = None
    annotation_id: Optional[str] = None

The s3_signed_url needs to be filled with the generated Signed Url if BadgerDoc is configured with the parameter: JOBS_RUN_PIPELINES_WITH_SIGNED_URL=True. This value can only be set to True if S3_PROVIDER is configured as aws_iam. By default, JOBS_RUN_PIPELINES_WITH_SIGNED_URL is set to False.

What needs to be changed

  1. Add JOBS_RUN_PIPELINES_WITH_SIGNED_URL= to .env.example

What needs to be changed

  1. Add JOBS_RUN_PIPELINES_WITH_SIGNED_URL= to the .env.example.

What needs to be additionally checked

Given that we're expecting a huge (almost unlimited) number of documents to be passed as S3 signed URLs, aioboto3 library integration into the jobs microservice could be considered to speed up the process of generating signed URLs.