Open paulf81 opened 1 week ago
@rglope , I have a question in writing the tests, I realized I wasn't clear on the units in threshold. In the discretize function, it defines the minimum distance between nodes to differentiate classes. Here the units is number of indices right? It is also used as the min_samples_split, which would mean the number of points for a leaf right? I think I understand then, the two are consistent, if a leaf must have N points, then it stands to reason, to different knots should be N points apart. Do I have that right? Would you mind to see if you agree with the tests I've added?
@rglope , I have a question in writing the tests, I realized I wasn't clear on the units in threshold. In the discretize function, it defines the minimum distance between nodes to differentiate classes. Here the units is number of indices right? It is also used as the min_samples_split, which would mean the number of points for a leaf right?
That's right. The threshold is the number of indices or registries in the SCADA. In the case of the discretize function, as you say, it sets the distance between the nodes. For the Decision Tree Regressor is considered the same way to create a node (or knot, which is what I call them) and a leaf. The only thing is that to create a leaf, it is needed only half the number of indices.
regr = DecisionTreeRegressor(max_depth=max_depth, min_samples_split=threshold, min_samples_leaf=threshold // 2, ccp_alpha=ccp_alpha)
More information can be found here: https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeRegressor.html. The original R function only had a minimun length argument, so during the translation to Python it seemed a good proportion.
I think I understand then, the two are consistent, if a leaf must have N points, then it stands to reason, to different knots should be N points apart. Do I have that right?
As I said, the minimum number of indices to create a leaf is 'threshold // 2', so maybe the argument for the discretize function should also be 'threshold // 2'. That way it would follow the reasoning you write (and the one I was actually going for).
Would you mind to see if you agree with the tests I've added?
I'll try them as soon as I can!
Finally, just an observation about the ccp_alpha argument if it helps. It is a pruning parameter to reduce the complexity of the regression. In the example it was set to a specific value since it made the regression cleaner, but I don't think there is necessarily a default value for every case. More information can be found here: https://scikit-learn.org/1.5/auto_examples/tree/plot_cost_complexity_pruning.html
Thanks for these responses @rglope ! I implemented the following change:
As I said, the minimum number of indices to create a leaf is 'threshold // 2', so maybe the argument for the discretize function should also be 'threshold // 2'. That way it would follow the reasoning you write (and the one I was actually going for).
ok tests pass and I confirmed docs all build locally. Good to merge for me once a few reviews are in, thanks all!
I think we should also add documentation describing the HOGER algorithm a bit (even just a paragraph would help a lot for new users).
apparant
Fixed!
I think we should also add documentation describing the HOGER algorithm a bit (even just a paragraph would help a lot for new users).
You mean beyond what is in docstring? This will appear in autodocs...
Implement HOGER homogenization
This PR implements the HOGER algorithm for homogenization of yaw/wind direction signals. It detects and removes changes in the calibration of these channels.
HOGER was developed by Paul Poncet (@engie-paul-poncet) and Thomas Duc (@engie-thomas-duc) of Engie, and Rubén González-Lope (@rglope) and Alvaro Gonzalez Salcedo (@alvarogonzalezsalcedo) of CENER within the TWAIN project.
This PR contains a new module
northing_offset_change_hoger.py
. Also a new example is included03_northing_calibration_hoger.ipynb
. This new example will be included in the online documentation as it shows the combined usage of different yaw/wd calibration modules.Some notes on connections
This functionality implements something akin to what is proposed in #148 (@ejsimley )
It also could be seen to address Issue #106
It also mirrors some features included in wind-up (@aclerc)
@Bartdoekemeijer , you'll see in the new example how the new module can fit within the current calibration workflow, will be curious for your opinions there.
@engie-paul-poncet, @engie-thomas-duc, @rglope, @alvarogonzalezsalcedo @misi9170 @ejsimley @Bartdoekemeijer @aclerc please use this PR to provide any feedback comments.
Open Questions:
Todo
Related issue, if one exists
148
106