Better handling of model features

HealthyPear commented 3 years ago

Currently the modeling features,

are defined through the configuration files (either regressor.yaml or classifier.yaml)
when the appropriate classes in protopipe.mva read them they pass through protopipe.mva.utils.prepare_data which adds modified versions of the basic DL1/DL2a variables into the dataframes (such as e.g. log10 of variables or more complex analytical combinations)
most importantly they are hardcoded into write_dl2.py

These 3 steps as they are make a bit difficult if not annoying and easily error-prone to play with different features.

My current idea is to make a dictionary - open to the user through the documentation - where all possible features and new ones (so this dictionary would be open-ended) are mapped to integers.

So something like,

1: hillas_width
2: hillas_intensity
....
14: some crazy function

In doing this then the user would input the features from the configuration files in form of a list of integers which will be then read by the DL2 script as it is mapping unambiguously the features to the estimation section.

kosack commented 3 years ago

May be a good reason to look at aict-tools for that part. It already has the input features fully configurable.

https://github.com/fact-project/aict-tools/blob/master/examples/config_energy.yaml

For the output of features generated by write_dl2, that will be replaced by the ctapipe DL2Writer or whatever we call it, and the philosophy will be similar to the DL1 files: compute and store all parameters always, so no configuration should be needed.

HealthyPear commented 3 years ago

May be a good reason to look at aict-tools for that part. It already has the input features fully configurable.

https://github.com/fact-project/aict-tools/blob/master/examples/config_energy.yaml

Yes, this issue is of course related to the current implementation provided by protopipe.mva to provide an easier use of protopipe from 0.5.0 onwards.

My initial intention is to allow the pipeline to host a number of libraries for ML. The only requirement for this would be a common configuration system and at least 1 common data format (like now we are using the pickled files from scikit-learn).

For the output of features generated by write_dl2, that will be replaced by the ctapipe DL2Writer or whatever we call it, and the philosophy will be similar to the DL1 files: compute and store all parameters always, so no configuration should be needed.

Yes these are no problem, here I refer to the Model features, so the parameters used to train the model(s).

HealthyPear commented 3 years ago

I saw the aict-tools config, but there they use simple unique DL1/DL2a variables (I have no idea about more complex choices and I would need time to play with it - that's the reason I want first to provide a first easy solution with what we currently have).

What I am talking about is handling features "anonymously" in a way that I do not have to worry about reading some more complex analytical functions like e.g. atan2(cog_y - dir_y, cog_x - dir_x) or log10(Width*Length/Size)

cta-observatory / protopipe

Better handling of model features #90