ai-cfia / ailab-datastore

This is a repo representing the data layer of multiple ailab projects
MIT License
2 stars 0 forks source link

As a dev, I want to enforce non-null constraints on entity fields to prevent records with all null fields #204

Open k-allagbe opened 1 week ago

k-allagbe commented 1 week ago

Description

Context Currently, in our database, most of our entities require at least one field (other than the ID) to be non-null during insertion. We have implemented this behavior through a mix of checks in functions for some entities and database triggers for others. This scattered approach leads to inconsistency and unnecessary complexity across the codebase.

Problem Statement Relying on functions and triggers to enforce non-null checks increases maintenance overhead and creates potential gaps in validation, which could allow unwanted records with all null fields. Implementing a database constraint for this behavior ensures uniform validation directly at the schema level, reducing the chance of errors and simplifying code management. A constraint also provides better performance as it avoids the overhead of trigger execution.

Acceptance Criteria

Francois-Werbrouck commented 5 days ago

Object in the Fertiscan transaction insertion workflow:

I will address each objects. However, please note there used to be ALOT of Not Null constraint in the DB and all of them (except a special few) had to be removed, otherwise there could be edge case creating exceptions.

All timestamp are updated and set via triggers to assure an automatic and reliable value.


Inspection:

erDiagram
  inspection {
    uuid id PK
    boolean verified "default=False"
    timestamp upload_date "default = now()"
    timestamp updated_at "default = now()"
    uuid inspector_id FK "NOT NULL"
    uuid label_info_id FK
    uuid fertilizer_id FK
    uuid sample_id FK
    uuid picture_set_id FK
  }

We could make the picture_set_id and label_info_id NOT NULL. This would help the referencing but would be restraining us in some edge cases


Label Information

erDiagram
  label_information {
    uuid id PK
    string lot_number
    string npk
    string registration_number
    float n
    float p
    float k
    string guaranteed_title_en
    string guaranteed_title_fr
    boolean title_is_minimal
    uuid company_info_id FK
    uuid manufacturer_info_id FK
  }

Everything can be Null and could be at once. The pipeline could pick every data for the Guaranteed_analysis, sub_labels, but all the data in the label_information table could be absent. There needs to be a label_information entity in the DB since it is the parent table of all the other information that needed specialization. This is a real edge case. Making something like even the name not null, would not take account of all the possibilities of our pipeline.


Organizations

erDiagram
  organization_information{
    uuid id PK
    string name 
    string website
    string phone_number
    uuid location_id FK
    boolean edited
  }

  location {
    uuid id PK
    string name
    string address "NOT NULL"
    uuid region_id FK
    uuid owner_id FK
  }

Everything can be null but we are making sure not at once. There are no needs to have an empty organization stored in the DB.

If the given location address is null, we do not create an instance of it in the DB.

Type of checking: If statement in the workflow and insert function and checked individually for the location address later on. No triggers involved


Metrics

erDiagram
    metric{
    uuid id PK
    float value
    boolean edited "default = false"
    ENUM metric_type 
    uuid unit_id FK
    uuid label_id FK
  }

Only either the value or the unit can be null, not both at once. If that is the case, it means the metric is empty of actual data from the Pipeline

Type of checking : If statement in the insertion/ update workflow it skips the value if all is null, it is also checks in the individual metric function if all values are null and raise an exception.

Checking in the workflows and the individual function assures no empty metrics will be created

We could be rigorous and make the metric_type NOT NULL, but it is attributed by the Datastore manually on insertion so the need is very low.


Guaranteed Analysis

erDiagram
  guaranteed{
    uuid id PK
    string read_name
    float value
    string unit
    boolean edited
    int element_id FK
    uuid label_id FK
  }

The GA is received in a list. We iterate over the list, if no GA have been reported by the pipeline, none are inserted. However, if there is an entry in the list, the name, value and unit can be empty, but not all at once. There needs to be at least one of the parameter with a value to be inserted or else we would have an empty record in the DB.

Type of checking: If statement, in the insertion/ update workflow it skips the value if all is null, there is also a check in the individual metric function if all values are null which then raise an exception.

Checking in the workflows and the individual function assures no empty GA rows will be created


Sub labels

Sub label is the only tricky part of the objects since it represent multiple array of text (english & french) all classified under a sub_label_type.

erDiagram
  sub_label {
    uuid id PK
    text text_content_fr "NOT NULL DEFAULT ='' "
    text text_content_en "NOT NULL DEFAULT='' "
    boolean edited "default = false"
    uuid label_id FK "NOT NULL"
    uuid sub_type_id FK "NOT NULL"
  }

For each type in the form, we iterate over both of the array and insert the french and english text value if there is one. If the array length is different for the 2 languages, some records SHOULD only have one of the text_content with data since they are inserting a NULL value.

A Null value in the text_content was/would creating/create the FE problems displaying the arrays and for the Datastore maintaining the logic tying the English and French content.

An issue where @k-allagbe worked on implemented:

-- DROP FUNCTION "fertiscan_0.0.15".check_null_or_empty_sub_label();

CREATE OR REPLACE FUNCTION "fertiscan_0.0.15".check_null_or_empty_sub_label()
 RETURNS trigger
 LANGUAGE plpgsql
AS $function$
BEGIN
    -- Check if both text_content_fr and text_content_en are NULL or empty
    IF (NEW.text_content_fr IS NULL OR NEW.text_content_fr = '') AND 
       (NEW.text_content_en IS NULL OR NEW.text_content_en = '') THEN
        -- Raise an exception and skip the insertion
        RAISE EXCEPTION 'Skipping insertion because both text_content_fr and text_content_en are NULL or empty';
        RETURN NULL;
    END IF;

    -- Replace NULL with empty strings
    IF NEW.text_content_fr IS NULL THEN
        NEW.text_content_fr := '';
    END IF;

    IF NEW.text_content_en IS NULL THEN
        NEW.text_content_en := '';
    END IF;

    -- Allow the insert if at least one is not NULL or empty
    RETURN NEW;
END;
$function$
;

Is the only instance where we check with a trigger for data. Since it seemed like a special case (there should always be a french and english version on an entry) I did not see problem with it. It also allow us to maintain the logic of not inserting nulls in our loops for the insertion/update workflow. If we want to migrate the usage of this trigger directly into insert_sub_label() it would facilitate the insertion. We'd have to tweak the update a little bit on the other hand (maybe the insert too for unicity).


discussion

Now let me state there are no triggers checking for the data before insertion. I've always been against using such triggers like stated, it adds complexity in flow of the data and could make maintenance more difficult.

I do not know where the impression of checking with triggers comes from since we only have one special case of such usage.


Stated Acceptance Criteria

Implement a database constraint ensuring at least one non-null field (besides ID) for relevant entities.

It is possible to put a constraint on a table to limit a combination of row to all be nulls. However, we'd still check in most cases if all the values are null anyway since the Pipeline can send all Null values in some cases. Adding the constraint would only add a layer of protection. I strongly advise not to remove the if statement since it would create Exceptions left and right.

I see the issue of the if statement not covering the case where a user perform a manual query to insert on the tables, but such case is not happening in all of our work. We use functions to not write redundant code and to make sure the data inserted is correctly formatted. However, adding a table constraint would prevent such case so I'm not against adding constraint and seeing if our tests are still passing.

  • Remove existing checks and triggers that handle this behavior.
  • Ensure uniform enforcement across all entities that require this validation.

Both of these are already the case, if we ignore the Sub_label trigger. If needed, it will be adressed.