ai-cfia / ailab-datastore

This is a repo representing the data layer of multiple ailab projects
MIT License
2 stars 0 forks source link

Implement structure to save Inspection digitalization efficiency #147

Open Francois-Werbrouck opened 2 months ago

Francois-Werbrouck commented 2 months ago

Context

With Fertiscan being a AI powered solution, we need to quantify the efficiency of the models. To do so, it is necessary to numerically compare the original_dataset with the user verified data. We will evaluate the inspection with multiple Levenshtein distance scores

TODO

Doc

erDiagram  
inspection_factual {
    uuid inspection_id PK
    uuid inspector_id
    uuid label_info_id
    uuid time_id FK
    uuid sample_id
    uuid company_id
    uuid manufacturer_id
    uuid picture_set_id
    timestamp inspection_date
    json original_dataset
    uuid verification_id
  }

    verification_dimension {
        uuid id PK
        int score
        int label_info_lev_total
        int label_name_lev
        int label_reg_num_lev
        int label_lot_num_lev
        int metrics_lists_modif
        int metrics_lev
        int manufacturer_field_edited
        int manufacturer_lev_total
        int company_field_edited
        int company_lev_total
        int instructions_en_lists_modif
        int instructions_fr_lists_modif
        int instructions_en_lev
        int instructions_fr_lev
        int cautions_en_lists_modif
        int cautions_fr_lists_modif
        int cautions_en_lev
        int cautions_fr_lev
        int guaranteeds_en_lists_modif
        int guaranteeds_fr_lists_modif
        int guaranteeds_en_lev
        int guaranteeds_fr_lev
    }

    inspection_factual ||--|| verification_dimension : "evaluate"
Francois-Werbrouck commented 1 month ago

Current known 'issues' that still need to be resolved:

Francois-Werbrouck commented 1 month ago

text over 255 character cant be compare and throw errors

Avenue found here if we dont want to partition our data. I'm also facing role/permission issues, I've open a ticket with the Database Server Admins

We should find a way to make the edited boolean useable

I've started experimenting with pg_trgm I still need to find a relevant threshold and how to deal with new additions into the arrays

ChromaticPanic commented 4 days ago

As a data analyst. I'm not sure the utility of storing this data. I think this might be unnecessary increase in database complexity. There are multiple ways to look at data. If we encode this in the database then it would be too cumbersome to try out different different evaluation metrics. A lot of algorithms are already implemented in Pandas or R. It's much easier to pull data and run the needed analytics in Jupyter Notebooks. It would be much faster to iterate algorithm changes. Much easier to make dashboards to look at trends too.

So this issue specifically is just for levenshtein distances. I think this is something easy enough to calculate when we want to see this information. Run a jupyter notebook once a week if we need to look at trends. We also do not want to unnecessarily increase our storage footprint. If recent metrics are more important (current model performance) then storing all the extra old data is just wasted space.

There are other metrics we could evaluate on. For example we could have a metric that detects if fields are being swapped.

So the trade off here in terms of storage vs runtime compute. This calculation is cheap enough even on bulk data that it doesn't make sense to pre calculate and store in the database.

The other trade off is flexibility. In terms of having multiple metrics and making changes to metrics. What happens when we change the schema? Then all the previously precalculated metrics become incomparable to new metrics.

Trade of in scaling is another thing. Compute is easy to scale vertically and horizontally. While our db instances can scale vertically, they're not set up and much harder to scale horizontally.