paulf81 commented 1 week ago

Implement HOGER homogenization

This PR implements the HOGER algorithm for homogenization of yaw/wind direction signals. It detects and removes changes in the calibration of these channels.

HOGER was developed by Paul Poncet (@engie-paul-poncet) and Thomas Duc (@engie-thomas-duc) of Engie, and Rubén González-Lope (@rglope) and Alvaro Gonzalez Salcedo (@alvarogonzalezsalcedo) of CENER within the TWAIN project.

This PR contains a new module northing_offset_change_hoger.py. Also a new example is included 03_northing_calibration_hoger.ipynb. This new example will be included in the online documentation as it shows the combined usage of different yaw/wd calibration modules.

Some notes on connections

This functionality implements something akin to what is proposed in #148 (@ejsimley )

It also could be seen to address Issue #106

It also mirrors some features included in wind-up (@aclerc)

@Bartdoekemeijer , you'll see in the new example how the new module can fit within the current calibration workflow, will be curious for your opinions there.

@engie-paul-poncet, @engie-thomas-duc, @rglope, @alvarogonzalezsalcedo @misi9170 @ejsimley @Bartdoekemeijer @aclerc please use this PR to provide any feedback comments.

Open Questions:

Are we happy with the defaults?
Do we want to continue with #148?

Todo

[x] Agree on defaults
[x] Documentation and credit sufficient?
[x] Make sure all new functions tested
[x] Confirm example works as expected
[x] Confirm documentation builds locally

Related issue, if one exists

148

106

paulf81 commented 1 week ago

@rglope , I have a question in writing the tests, I realized I wasn't clear on the units in threshold. In the discretize function, it defines the minimum distance between nodes to differentiate classes. Here the units is number of indices right? It is also used as the min_samples_split, which would mean the number of points for a leaf right? I think I understand then, the two are consistent, if a leaf must have N points, then it stands to reason, to different knots should be N points apart. Do I have that right? Would you mind to see if you agree with the tests I've added?

rglope commented 1 week ago

@rglope , I have a question in writing the tests, I realized I wasn't clear on the units in threshold. In the discretize function, it defines the minimum distance between nodes to differentiate classes. Here the units is number of indices right? It is also used as the min_samples_split, which would mean the number of points for a leaf right?

That's right. The threshold is the number of indices or registries in the SCADA. In the case of the discretize function, as you say, it sets the distance between the nodes. For the Decision Tree Regressor is considered the same way to create a node (or knot, which is what I call them) and a leaf. The only thing is that to create a leaf, it is needed only half the number of indices.

regr = DecisionTreeRegressor(max_depth=max_depth, min_samples_split=threshold, min_samples_leaf=threshold // 2, ccp_alpha=ccp_alpha)

More information can be found here: https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeRegressor.html. The original R function only had a minimun length argument, so during the translation to Python it seemed a good proportion.

I think I understand then, the two are consistent, if a leaf must have N points, then it stands to reason, to different knots should be N points apart. Do I have that right?

As I said, the minimum number of indices to create a leaf is 'threshold // 2', so maybe the argument for the discretize function should also be 'threshold // 2'. That way it would follow the reasoning you write (and the one I was actually going for).

Would you mind to see if you agree with the tests I've added?

I'll try them as soon as I can!

Finally, just an observation about the ccp_alpha argument if it helps. It is a pruning parameter to reduce the complexity of the regression. In the example it was set to a specific value since it made the regression cleaner, but I don't think there is necessarily a default value for every case. More information can be found here: https://scikit-learn.org/1.5/auto_examples/tree/plot_cost_complexity_pruning.html

paulf81 commented 6 days ago

Thanks for these responses @rglope ! I implemented the following change:

As I said, the minimum number of indices to create a leaf is 'threshold // 2', so maybe the argument for the discretize function should also be 'threshold // 2'. That way it would follow the reasoning you write (and the one I was actually going for).

paulf81 commented 6 days ago

ok tests pass and I confirmed docs all build locally. Good to merge for me once a few reviews are in, thanks all!

misi9170 commented 3 days ago

I think we should also add documentation describing the HOGER algorithm a bit (even just a paragraph would help a lot for new users).

paulf81 commented 3 days ago

apparant

Fixed!

paulf81 commented 3 days ago

I think we should also add documentation describing the HOGER algorithm a bit (even just a paragraph would help a lot for new users).

You mean beyond what is in docstring? This will appear in autodocs...

NREL / flasc