Open Francois-Werbrouck opened 2 months ago
Current known 'issues' that still need to be resolved:
text over 255 character cant be compare and throw errors
Avenue found here if we dont want to partition our data. I'm also facing role/permission issues, I've open a ticket with the Database Server Admins
We should find a way to make the edited boolean useable
I've started experimenting with pg_trgm I still need to find a relevant threshold and how to deal with new additions into the arrays
As a data analyst. I'm not sure the utility of storing this data. I think this might be unnecessary increase in database complexity. There are multiple ways to look at data. If we encode this in the database then it would be too cumbersome to try out different different evaluation metrics. A lot of algorithms are already implemented in Pandas or R. It's much easier to pull data and run the needed analytics in Jupyter Notebooks. It would be much faster to iterate algorithm changes. Much easier to make dashboards to look at trends too.
So this issue specifically is just for levenshtein distances. I think this is something easy enough to calculate when we want to see this information. Run a jupyter notebook once a week if we need to look at trends. We also do not want to unnecessarily increase our storage footprint. If recent metrics are more important (current model performance) then storing all the extra old data is just wasted space.
There are other metrics we could evaluate on. For example we could have a metric that detects if fields are being swapped.
So the trade off here in terms of storage vs runtime compute. This calculation is cheap enough even on bulk data that it doesn't make sense to pre calculate and store in the database.
The other trade off is flexibility. In terms of having multiple metrics and making changes to metrics. What happens when we change the schema? Then all the previously precalculated metrics become incomparable to new metrics.
Trade of in scaling is another thing. Compute is easy to scale vertically and horizontally. While our db instances can scale vertically, they're not set up and much harder to scale horizontally.
Context
With Fertiscan being a AI powered solution, we need to quantify the efficiency of the models. To do so, it is necessary to numerically compare the original_dataset with the user verified data. We will evaluate the inspection with multiple Levenshtein distance scores
TODO
Doc