Metrics (precision, recall, F1-Score)

anasvaf commented 5 years ago

Hello Odyssea,

When I run the GRUWithWindow code (python3 experiment.py kettle) from the https://github.com/OdysseasKr/online-nilm repository, I get the following numbers for metrics after training the class kettle for 4 houses and testing in the last house.

Epoch 00001: saving model to experiments/kettle/CHECKPOINT-kettle-1epochs.hdf5
Epoch 2/3
800000/800000 [==============================] - 722s 902us/step - loss: 0.0039

Epoch 00002: saving model to experiments/kettle/CHECKPOINT-kettle-2epochs.hdf5
Epoch 3/3
800000/800000 [==============================] - 698s 873us/step - loss: 0.0039

Epoch 00003: saving model to experiments/kettle/CHECKPOINT-kettle-3epochs.hdf5
/home/anasvaf/Desktop/MultiLabel_Energy/GRUWithWindow/metrics.py:40: RuntimeWarning: invalid value encountered in true_divide
  return tp/float(tp+fp)
============ Recall: 0.0
============ Precision: nan
============ Accuracy: 0.9962941517081645
============ F1 Score: nan
============ Relative error in total energy: 0.012862401778851902
============ Mean absolute error(in Watts): 21.46095589801764

I can confirm also the same exact numbers when I run the RNN code using the https://github.com/OdysseasKr/neural-disaggregator repository.

I am using Python 3.5 and the following library versions: numpy: 1.15.2 matplotlib: 3.0.0 pandas: 0.23.4 pytables: 3.4.4 nilmtk: 0.2 keras: 2.2.4

Could you help me with that?

Thanks in advance!

-Tasos

OdysseasKr commented 5 years ago

Hello Tasos! The first 4 metrics count the "activations"(i.e the number of times that the appliance was turned on) of in order to compute the result. Activations are detected using an appliance-specific threshold. If the output signal is higher than the threshold then the device is considered as "active" or "on". For example, if the activation threshold for fridge is 60 Watts whenever the output signal goes higher than 60 Watts, the fridge is considered as active.

For the activation detection, the on_power_threshold() method from NILMTK is used.

In my experiments, whenever this happened, it meant that there were no activations. Which hints that your output signal never exceeded the threshold. Please inspect the output of your network. You may have to train your model longer before you manage to get results.

anasvaf commented 5 years ago

Thanks for the prompt response, Odyssea!

My concern is that the parts that I have added to your code to make it able to train across buildings are probably not correct. Is it ok with you to double check it if I paste it here?

OdysseasKr commented 5 years ago

I can not really check your code right now. However, if you could post a plot of the output along with the ground truth data, we can probably see whether it's a programming error or poor training.

anasvaf commented 5 years ago

I will do that. Thanks a lot for all the help!

maechler commented 5 years ago

I am getting a similar result with DAEDisaggregator when I trained dish washer on building 1-5 of the REDD dataset and then try to evaluate on building 6.

============ Recall: 0.0
============ Precision: nan
============ Accuracy: 0.2995691297390533
============ F1 Score: nan
============ Relative error in total energy: 0.837433614267287
============ Mean absolute error(in Watts): 0.45369448176799543

When I only train on one building and evaluate it, I do not get the NaN errors but still it does not predict anything as my recall stays around 0.

@anasvaf Could you solve this? @OdysseasKr Do you think that is a programming error? How much data would be needed to get decent results?

OdysseasKr commented 5 years ago

Hi @maechler. Looking at your graph I am guessing that your prediction is the blue line. I am guessing that the value is always below the "on" threshold. This means that the prediction of your model is that the device is never turned on. This is what causes the NaN values and the 0 recall.

maechler commented 5 years ago

@OdysseasKr Thank you very much for your answer! As I had a closer look at the diagram I realised that it is also very strange that the orange graph (ground truth) has its maximum at 4W. This is not the case when I only train on one building:

Have you been able to successfully use the train_across_buildings method?

I tried some of the ukdale-test.py and redd-test.py scripts in this repository, but I never got decent results. I think an additional problem could be that the training data is too sparse. The loss is already very low from the beginning, probably because the neural net learned to predict 0 all the time. I think Jack Kelly in his experiments used a form of artificially generated training data that he computed after extracting all the activations of an appliance. Such training data contains many more occurrences of the trained appliance.

It would be really nice if you could share how you managed to get good results. E.g.

What dataset did you use?
What periods did you use for your train / test data?
Did you train across multiple buildings?
Which neural net did you use?
How many epochs did you train?
What results did you get?

Any help would really be appreciated!

OdysseasKr commented 5 years ago

Personally I found it harder to train across buildings because, often, there are differences in the way the same device operates in different buildings. Indeed the data is very sparse and this is a problem for NILM in general.

It would be really nice if you could share how you managed to get good results

Are you referring to some results in particular?

I general, I would suggest taking a look at the following papers, if you haven't done that already:

https://arxiv.org/abs/1507.06594 Performance of shallow networks on NILM
https://arxiv.org/abs/1612.09106 A convolutional network on nilm
https://dl.acm.org/citation.cfm?id=3201011 This one was written by my colleagues and me and is accompanied by this repo: https://github.com/OdysseasKr/online-nilm

maechler commented 5 years ago

Thanks very much for your response! I understand, but for me training on only one building does not work well either.

Are you referring to some results in particular?

I am looking for the optimal parameters to let the example scripts redd-test.py or ukdale-test.py run and produce good results. When I let them run as they are in this repository, I always get results as mentioned above.

Thanks, I will have a closer look at https://github.com/OdysseasKr/online-nilm, as it seems pretty much configured except for the number of epochs.

maechler commented 5 years ago

For anyone interested in this, I managed to get some non constant estimations with GRUWithWindow (https://github.com/OdysseasKr/online-nilm) after only 10 epochs of training on the dish_washer appliance.

============ Recall: 0.3191919191919192
============ Precision: 0.6955245781364637
============ Accuracy: 0.9717776491024899
============ F1 Score: 0.43757212093237946
============ Relative error in total energy: 0.09651187438233116
============ Mean absolute error(in Watts): 25.317596060405524

OdysseasKr / neural-disaggregator

Metrics (precision, recall, F1-Score) #10