epam / badgerdoc

Apache License 2.0
31 stars 32 forks source link

Add categories to job on annotation commit #841

Closed khyurri closed 3 months ago

khyurri commented 4 months ago

BadgerDoc requires categories to be stored in the jobs table for display in the UI. Originally, BadgerDoc pipelines contained the available categories and assigned them to the job upon creation. However, the current BadgerDoc implementation excludes its own pipelines, necessitating a different method for committing categories.

There are several possible approaches to commit categories to a job:

  1. Explicitly during job creation - This approach is used for Annotation or Extraction + Annotation jobs, and we could employ it for Extraction jobs as well.
  2. Explicitly through a BadgerDoc API call from an external pipeline.
  3. Implicitly when the pipeline commits annotation results.

Options 1 and 2 are already available; however, if the annotation commit contains categories that have not been assigned, we encounter a bug where the categories are displayed as undefined. Also, explicit binding of categories to the job is a requirement for the Annotation and Annotation + Extraction jobs, because BadgerDoc needs to know which categories must be displayed during annotation.

This task is to implement the implicit assignment of categories to a job when annotation results are committed.

General Algorithm

When the annotation microservice commits, it must call the jobs microservice to assign the passed categories to the job. If the categories do not exist under the current tenant, the annotation microservice should not store the annotation results, and instead should send back an error. The jobs microservice should check the difference between the existing categories and those incoming, and then store the merged result. Since the categories in the jobs table are stored as JSONB, the calculation of the difference must be done in Python code. We need to lock the row for reading and modification until a new value is stored.