developmentseed / tecnico-energy-app

https://dashboard-ds-peach.vercel.app/
0 stars 0 forks source link

How to handle metrics classification inconsistencies at time of ingestion #27

Open alukach opened 3 months ago

alukach commented 3 months ago

Our system works by normalizing metrics into unique combinations of category, usage, and source. We look to the metrics_metadata table to determine which fields it the metrics table can be used to represent a given category, usage, and source datapoint for a given scenario.

For example, if we are interested in the total cost for a study, that would be represented by Category=Cost, Usage=, Source= (where a blank value implies all). To determine which field we would render for this data on the baseline scenario, we can look to our metrics_metadata table and find the row that contains a blank scenario column, a category column with a value of Cost, and a blank usage and source column. To find out how any given scenario would affect that value, we would look for a row with a scenario column matching the scenario of interest, a category column with a value of Cost, and a blank usage and source column.

However, we are noticing some data inconsistencies when reviewing Municipal Data v4:

  1. There are no row for the baseline scenario where category=Cost, usage=, source=.
  2. There are multiple for each scenario where category=Cost, usage=, source=
image

This brings up the following question:

  1. How to handle situations where a scenario provides a category, usage, source combination that doesn't exist in the baseline scenario? Possible solutions: ignore, throw an error
  2. How to handle situations where there are multiple rows within a given scenario with competing combinations of category, usage, and source? Possible solutions: accept first, throw an error
yellowcap commented 3 months ago

If it is not difficult to throw an error I think that would be better. In these cases we could disable the study until the ingestion is complete and successful.

I would vote for always assuming the data is complete and consistent, and fail if not. The user will then have to edit their data until the ingestion works. This is almost like a validation at ingestion type of thing. Otherwise there is a high risk to ingest unexpected data that might be inconsistent without the user noticing.

I will ask Ricardo to review these cases in the municipal data.