antonroman / smart_meter_data_analysis

This repository contains all the code developed to analyze the smart meter data with HTM and LSTM
2 stars 0 forks source link

Search for N/A or missing values in S02 and S05 time-series and find the optimal approach to fil them #3

Closed antonroman closed 3 years ago

antonroman commented 3 years ago

We should check the integrity of the provided data. It is useful for two reasons:

To complete the N/A values there are different strategies:

First please check if there are many N/A and missing values and then we'll decide what to do.

We could even compare different approaches to fill the missing data if there is a relevant number of corrupt rows. The best approach would be the one which gives gets the best forecast performance from the model.

On the other side, it is worth checking this paper (https://www.sciencedirect.com/science/article/pii/S2352467720303003) as it seems to explain how to deal with this problem. I'll try to read it as well before our meeting.

Thanks, great job!!

antonroman commented 3 years ago

https://ieeexplore.ieee.org/document/7858189?reload=true another interesting reading on the topic

gbarreiro commented 3 years ago

I've created this script to search for NA values and I haven't found any, apparently... Those are good news, sure, but anyway, I will show you the script in our next meeting, so you can check if it's actually right. Meanwhile, I won't close the issue yet, at least until we verify together that the script is doing its job right.

antonroman commented 3 years ago

The script looks fine, it makes sense since the data is obtained from Deicom APIs, they may process the data at some point before. In any case it is good news. If we get any nan value for lad values we could use a function like this to fill the values with the previous day load value for the same time (this would be for S02):

# fill missing values with a value at the same time one day ago
def fill_missing(values):
    one_day =24
    for row in range(values.shape[0]):
        for col in range(values.shape[1]):
            if isnan(values[row, col]):
                values[row, col] = values[row - one_day, col]
antonroman commented 3 years ago

If you could check this for both S02 and S05 files feel free to close the issue, good job! :-)

gbarreiro commented 3 years ago

I did, so I close the issue