antonroman / smart_meter_data_analysis

This repository contains all the code developed to analyze the smart meter data with HTM and LSTM
2 stars 0 forks source link

generate CSV files at different aggregation levels #5

Open antonroman opened 3 years ago

antonroman commented 3 years ago

We would need to create 10 CSV files aggregating the values of 10%, 20%... 100% of the S02 and S05 values. So we will have 20 CSV files in total. We have to aggregate values which corresponds to the same sensing time.

We will use them later with forecasting models to check how the MAPE improves with the aggregation level (at least in theory).

gbarreiro commented 3 years ago

Should I add the meter ID to the aggregated CSV files?

antonroman commented 3 years ago

Not needed, for the forecasting experiment the meter Id is not relevant. Good question anyway :-)

gbarreiro commented 3 years ago

We have to aggregate values which corresponds to the same sensing time.

If I have properly understood, this means that if we have a file A with data from 1/1/2019 to 1/5/2020, a file B with data from 1/3/2019 to 1/8/2020 and a file C with data from 1/1/2019 to 1/2/2020, we should filter out the records before 1/3/2019 and after 1/2/2020, so for each day we have data from the same number of meters, right? In that case, we have a problem: the minimum date from some meters is higher than the maximum date from others, meaning that the intersection between the timestamp windows of a 10%, 20% or whatever aggregation will be empty. How can we handle this situation? In case I misunderstood what you meant, please correct me.

antonroman commented 3 years ago

Ok, I understand. I wrongly assumed that the meter would contain the same periods.

It would be great if we can define the set of files which have the biggest overlapping period. I shared a codelab with you that could help. You need to generate a csv file with the start and end dates of the timeseries included in each meter file.

We should try to the highest number of samples for period as long as possible. We can try with 365 days and check with longer periods. When we get the list o meter which matches this long period we can generate the aggregation csv files. Thanks!

gbarreiro commented 3 years ago

I have tried your notebook with the generated CSV input file, and I have calculated the number of meters which have a min period of x, ranging x from 360 to 720. Here are the results with a step of 30:

Period of 360 days: 15946 meters Period of 390 days: 15814 meters Period of 420 days: 15801 meters Period of 450 days: 15772 meters Period of 480 days: 15750 meters Period of 510 days: 15746 meters Period of 540 days: 15689 meters Period of 570 days: 15621 meters Period of 600 days: 15215 meters Period of 630 days: 14932 meters Period of 660 days: 0 meters Period of 690 days: 0 meters Period of 720 days: 0 meters

And the results with a step of 5 between 630 and 660:

Period of 630 days: 14932 meters Period of 635 days: 14829 meters Period of 640 days: 14805 meters Period of 645 days: 14554 meters Period of 650 days: 14139 meters Period of 655 days: 62 meters Period of 660 days: 0 meters

From this results, I think that the best option would be to opt for 650 days, since we would have 14139 meters, that means, 87.6% of the meters. What do you think?

antonroman commented 3 years ago

Yes,

sounds good to me, good job :-) Glad to see the algorithm works as well

On Wed, Jun 9, 2021 at 4:14 PM Guillermo Barreiro @.***> wrote:

I have tried your notebook with the generated CSV input file, and I have calculated the number of meters which have a min period of x, ranging x from 360 to 720. Here are the results with a step of 30:

Period of 360 days: 15946 meters Period of 390 days: 15814 meters Period of 420 days: 15801 meters Period of 450 days: 15772 meters Period of 480 days: 15750 meters Period of 510 days: 15746 meters Period of 540 days: 15689 meters Period of 570 days: 15621 meters Period of 600 days: 15215 meters Period of 630 days: 14932 meters Period of 660 days: 0 meters Period of 690 days: 0 meters Period of 720 days: 0 meters

And the results with a step of 5 between 630 and 660: Period of 630 days: 14932 meters Period of 635 days: 14829 meters Period of 640 days: 14805 meters Period of 645 days: 14554 meters Period of 650 days: 14139 meters Period of 655 days: 62 meters Period of 660 days: 0 meters

From this results, I think that the best option would be to opt for 650 days, since we would have 14139 meters, that means, 87.6% of the meters. What do you think?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antonroman/smart_meter_data_analysis/issues/5#issuecomment-857731213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6UKYKRS77JH5PMAO7W73LTR5ZOBANCNFSM45U46BKA .

gbarreiro commented 3 years ago

I have created the S02 and S05 aggregations with all the meters that have data from 1-1-2020 to 31-12-2020. I was not able to upload the CSV files since their size is insane, but I have uploaded the Python script to generate them.

Later this week I will try to see if it's possible to widen the time window while not losing a significant amount of meters.

antonroman commented 3 years ago

Thanks Guillermo,

I completed the algorithm a little bit (I already updated it to the repo). These are the values for start, end date and list of file:

Max duration: 656 -> Longest duration in days found in the CSV files. Period of 360 days: 15701 meters starting at 2020-03-04 23:00:00 Period of 365 days: 15699 meters starting at 2020-02-28 23:00:00 Period of 370 days: 15698 meters starting at 2020-03-01 23:00:00 Period of 375 days: 15695 meters starting at 2020-02-28 23:00:00 Period of 380 days: 15691 meters starting at 2020-02-23 23:00:00 Period of 385 days: 15688 meters starting at 2020-02-18 23:00:00 Period of 390 days: 15680 meters starting at 2020-02-15 23:00:00 Period of 395 days: 15677 meters starting at 2020-02-08 23:00:00 Period of 400 days: 15675 meters starting at 2020-02-05 23:00:00 Period of 405 days: 15669 meters starting at 2020-01-28 23:00:00 Period of 410 days: 15666 meters starting at 2020-01-25 01:00:00 Period of 415 days: 15663 meters starting at 2020-01-18 23:00:00 Period of 420 days: 15661 meters starting at 2020-01-14 14:00:00 Period of 425 days: 15657 meters starting at 2019-11-28 01:00:00 Period of 430 days: 15656 meters starting at 2019-11-28 01:00:00 Period of 435 days: 15656 meters starting at 2019-11-28 01:00:00 Period of 440 days: 15655 meters starting at 2019-11-28 01:00:00 Period of 445 days: 15655 meters starting at 2019-11-28 01:00:00 Period of 450 days: 15654 meters starting at 2019-11-28 01:00:00 Period of 455 days: 15653 meters starting at 2019-11-24 23:00:00 Period of 460 days: 15653 meters starting at 2019-11-28 01:00:00 Period of 465 days: 15653 meters starting at 2019-11-28 01:00:00 Period of 470 days: 15652 meters starting at 2019-11-24 23:00:00 Period of 475 days: 15620 meters starting at 2019-11-22 23:00:00 Period of 480 days: 15616 meters starting at 2019-11-17 23:00:00 Period of 485 days: 15610 meters starting at 2019-11-08 09:00:00 Period of 490 days: 15608 meters starting at 2019-11-01 23:00:00 Period of 495 days: 15606 meters starting at 2019-10-30 01:00:00 Period of 500 days: 15598 meters starting at 2019-10-25 11:00:00 Period of 505 days: 15588 meters starting at 2019-10-19 22:00:00 Period of 510 days: 15582 meters starting at 2019-10-14 22:00:00 Period of 515 days: 15577 meters starting at 2019-10-12 22:00:00 Period of 520 days: 15572 meters starting at 2019-10-04 22:00:00 Period of 525 days: 15570 meters starting at 2019-09-30 00:00:00 Period of 530 days: 15567 meters starting at 2019-09-27 00:00:00 Period of 535 days: 15551 meters starting at 2019-09-23 00:00:00 Period of 540 days: 15538 meters starting at 2019-09-19 00:00:00 Period of 545 days: 15524 meters starting at 2019-09-08 22:00:00 Period of 550 days: 15520 meters starting at 2019-09-05 00:00:00 Period of 555 days: 15499 meters starting at 2019-09-04 00:00:00 Period of 560 days: 15134 meters starting at 2019-08-11 22:00:00 Period of 565 days: 15133 meters starting at 2019-08-16 00:00:00 Period of 570 days: 15133 meters starting at 2019-08-16 00:00:00 Period of 575 days: 15131 meters starting at 2019-08-11 22:00:00 Period of 580 days: 15127 meters starting at 2019-08-04 00:00:00 Period of 585 days: 15126 meters starting at 2019-08-01 22:00:00 Period of 590 days: 15116 meters starting at 2019-07-31 00:00:00 Period of 595 days: 14855 meters starting at 2019-07-17 22:00:00 Period of 600 days: 14854 meters starting at 2019-07-12 22:00:00 Period of 605 days: 14851 meters starting at 2019-07-04 22:00:00 Period of 610 days: 14848 meters starting at 2019-07-04 22:00:00 Period of 615 days: 14842 meters starting at 2019-07-04 22:00:00 Period of 620 days: 14756 meters starting at 2019-06-28 14:00:00 Period of 625 days: 14727 meters starting at 2019-06-25 22:00:00 Period of 630 days: 14578 meters starting at 2019-06-20 22:00:00 Period of 635 days: 14473 meters starting at 2019-06-15 22:00:00 Period of 640 days: 14154 meters starting at 2019-06-10 22:00:00 Period of 645 days: 14033 meters starting at 2019-06-03 22:00:00 Period of 650 days: 12832 meters starting at 2019-06-01 00:00:00 Period of 655 days: 35 meters starting at 2019-05-26 23:00:00

590 days could be an optimal duration, since the number of meters is still high and the algorithm "can learn" from the values from the previous year, at least in the test dataset.

Regarding the aggregation. I'm sorry because there was a misunderstanding, you can keep the files and the script since they can be valid for some future analysis, what I really meant was to sum the values from the timeseries of each meter. My apologies for the misunderstanding. This way the file should be the same size as the original ones and we can perform the simulation using it.

Please, let me know if you need a short call today to clarify this. Thanks again for your work, I really appreciate it.

O 16:19, Lun, 21 xuñ 2021, Guillermo Barreiro @.***> escribiu:

I have created the S02 and S05 aggregations with all the meters that have data from 1-1-2020 to 31-12-2020. I was not able to upload the CSV files since their size is insane, but I have uploaded the Python script to generate them https://github.com/antonroman/smart_meter_data_analysis/blob/master/data_processing/aggregate_reactive_values.py .

Later this week I will try to see if it's possible to widen the time window while not losing a significant amount of meters.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antonroman/smart_meter_data_analysis/issues/5#issuecomment-865119899, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6UKYKFJFLUOETFW3BFVWLTT5KBNANCNFSM45U46BKA .

gbarreiro commented 3 years ago

Oh, what a misunderstanding! Of course, it seems more logical to sum the data from all the meters, since the generated CSV file will be much smaller and the plot rendering will be much faster. I will change the scripts to fix that wrong behavior, and then I will evolve the code to extract the data for the 15166 meters that have a common time window of 590 days.

gbarreiro commented 3 years ago

Another question: to avoid having invalid plots in the issue #4 , should I remove outliers during the aggregation process?

antonroman commented 3 years ago

You can try first w/o removing them and if it has no noticeable impact you can leave them.

In any case, if it is a quick change I would appreciate it if you can remove the outliers.

On Tue, Jun 22, 2021 at 1:06 PM Guillermo Barreiro @.***> wrote:

Another question: to avoid having invalid plots in the issue #4 https://github.com/antonroman/smart_meter_data_analysis/issues/4 , should I remove outliers during the aggregation process?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antonroman/smart_meter_data_analysis/issues/5#issuecomment-865888114, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6UKYLCH3JVO47XMUXT26TTUBVBVANCNFSM45U46BKA .

gbarreiro commented 3 years ago

I have updated the code to aggregate correctly the meters, and the speed of the script has drastically improved. Moreover, the generated CSV files are much smaller, therefore I was able to upload them to the repository.

Unfortunately, and as expected, the charts still have those ugly peaks due to outliers, so my next step will be to upgrade the script in order to detect and remove outliers. My question here is if in case of finding an outlier for a date X in a meter Y, should I exclude from the aggregation just the data for the date X from the meter Y, or should I exclude all the data from the meter Y?

antonroman commented 3 years ago

Hi Guillermo,

if the number of meters with outliers is small, please keep them in a separate list and remove them from the aggregation. If the number is higher (>100), then just remove the outliers and fill with the value for the same time, the day before.

Thanks!

On Tue, Jun 22, 2021 at 6:18 PM Guillermo Barreiro @.***> wrote:

I have updated the code to aggregate correctly the meters, and the speed of the script has drastically improved. Moreover, the generated CSV files are much smaller, therefore I was able to upload them to the repository.

Unfortunately, and as expected, the charts still have those ugly peaks due to outliers, so my next step will be to upgrade the script in order to detect and remove outliers. My question here is if in case of finding an outlier for a date X in a meter Y, should I exclude from the aggregation just the data for the date X from the meter Y, or should I exclude all the data from the meter Y?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/antonroman/smart_meter_data_analysis/issues/5#issuecomment-866130095, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6UKYLS5K5N2N4YNM5OZHDTUCZULANCNFSM45U46BKA .

antonroman commented 3 years ago

Just pending to remove the meters which have more than 0.5% outlier values (>40 for one-year period).