EXCITED-CO2 / excited-workflow

A Machine Learning workflow to produce a dataset of global net ecosystem CO2 exchange fluxes.
https://excited-workflow.readthedocs.io/
Apache License 2.0
4 stars 0 forks source link

Use land cover class as categorical data #37

Open BSchilperoort opened 9 months ago

BSchilperoort commented 9 months ago

In #36 the land cover class is used as a normal (continuous value) variable. However, it is actually categorical data.

For this we can tell pycaret to see it as such, which will make it use "one hot" encoding. However, the pycaret workflow resulting from this does not convert to onnx, so we will have to find a way to make it convert or write our own encoder pipeline (and stop using pycaret in the production code).

BSchilperoort commented 9 months ago

It turns out that LightGBM supports categorical data. However, this is not compliant with ONNX. See the dicussion here https://github.com/onnx/onnxmltools/issues/309

The LightGBM documentation claims "LightGBM can use categorical features as input directly. It doesn’t need to convert to one-hot encoding, and is much faster than one-hot encoding (about 8x speed-up).", however we cannot make use of this if we want to support ONNX.

geek-yang commented 9 months ago

Massage the categorical data with one-hot encoding is usually not an optimal solution, especially for tree based models. It is good to know that lightGBM actually support categorical data.

Also I never realize that ONNX does not support categorical data. Good to know this. The ONNX thingy is very clumsy. All ML model layers/structures must be specifically engineered in ONNX, to ensure that the ONNX runner knows how to operate with the given input. So there is no (easy) way to bypass it 😂. Very annoying.

BSchilperoort commented 9 months ago

Massage the categorical data with one-hot encoding is usually not an optimal solution, especially for tree based models. It is good to know that lightGBM actually support categorical data.

Yeah the LightGBM documentation does mention that.

ONNX itself does support categorical data, however, the way LightGBM manages categorical data is incompatible with ONNX (at least for now).

In this notebook I managed to put lightGBM in a sklearn pipeline with one-hot encoding, and export to ONNX: pipeline_onnx_model.ipynb.zip