epam / badgerdoc

Apache License 2.0
31 stars 32 forks source link

BadgerDoc can execute pipelines by retrieving tasks and revisions from `finished` jobs #848

Closed khyurri closed 3 months ago

khyurri commented 3 months ago

BadgerDoc should allow users to send file revisions back to the pipeline engine to enhance ML models following manual checks or annotations. After an annotation is committed, each file has revisions. We should be able to send the latest revision to the pipeline with the file. However, you don't need to select a file for revision; selecting finished jobs will automatically retrieve tasks, files, and revisions.

Back-end

  1. We need to add or verify existing functionality to get a list of the latest revisions by task or by file.
  2. User validation must be added - users can create new jobs with finished jobs only.
  3. When a new job is created, the back-end must check if a list of files, datasets or jobs is being passed.
  4. Add a new field previous_jobs to the job table. This field must be JSONB and contains the IDs of the passed jobs, however, all other fields (tasks, files) should be filled as they are now.
  5. BadgerDoc must send an event to Pipelines similar to the example below:
{
    "files_data": [
        {
            "revision": "00afbbcd-9628-479b-89cb-25aa893f46f4",
            "bucket": "local",
            "input": {
                "job_id": 47
            },
            "input_path": "files/344/344.pdf",
            "output_path": null,
            "pages": [
                8,
                9
            ],
            "file_id": 344,
            "s3_signed_url": "http://badgerdoc-minio:9000/local/files/344/344.pdf?AWSAccessKeyId=minioadmin&Signature=TfJOWzctdD8UcPkg3EsQBvpU8go%3D&Expires=1715783449"
        },
       {
            "revision": "cd13076f-9c10-4afc-bc8d-dbeca34ee857",
            "bucket": "local",
            "input": {
                "job_id": 46
            },
            "input_path": "files/344/344.pdf",
            "output_path": null,
            "pages": [
                8,
                9
            ],
            "file_id": 344,
            "s3_signed_url": "http://badgerdoc-minio:9000/local/files/344/344.pdf?AWSAccessKeyId=minioadmin&Signature=TfJOWzctdD8UcPkg3EsQBvpU8go%3D&Expires=1715783449"
        }
    ],
    "job_id": 48,
    "tenant": "local"
}

The revision field is filled in by the latest task revision.

Users can start both the Extraction and Extraction and Annotation jobs. However, for now, we won't implement different behavior for the Annotation part. In the future, we will use the passed revision as the base for the annotation.

Front-end

In the dataset selection screen, we need to add a tab to select jobs instead of files. Users can select files or jobs - we shouldn't allow both to be selected for one job.

All other scenarios remain the same. However, when creating a job, the form must send revisions instead of files.

khyurri commented 3 months ago

Tested and accepted