alan-turing-institute / CROP

CROP is a Research Observation Platform
MIT License
25 stars 4 forks source link

Test ideas for data quality control #145

Open myyong opened 2 years ago

myyong commented 2 years ago

This issue records the quality control issues with the data that we have identified while running models trained on tee_208.RDS on new data.

These should be converted into test cases for checking data quality before running predictions.

  1. Duplicate values
  2. Missing/late values
  3. Reads columns in different order
  4. Returning only date when number of days requested > 270
myyong commented 2 years ago

Duplicated values

I can see it in utc_energy_data:

select timestamp, electricity_consumption, sensor_id, count(*)                                                                 
from utc_energy_data
where sensor_id=16
group by timestamp, electricity_consumption, sensor_id
HAVING count(*) > 1
order by timestamp desc;

The duplicated values run from 2021-05-31 00:30:00 till 2021-06-02 00:00:00.