MAINT Introduce use of set_output to output dataframes

INRIA / scikit-learn-mooc

Machine learning in Python with scikit-learn MOOC

https://inria.github.io/scikit-learn-mooc

Creative Commons Attribution 4.0 International

1.12k stars 516 forks source link

MAINT Introduce use of set_output to output dataframes #683

Closed ArturoAmorQ closed 1 year ago

ArturoAmorQ commented 1 year ago

Pandas output with set_output API is available since v 1.2.

This PR introduces such a nice feature to the MOOC.

ogrisel commented 1 year ago

This is still draft so I did not merge. But feel free to undraft and merge.

ogrisel commented 1 year ago

I think we should use set_output(transform="pandas") by default in the notebook titled "Encoding of categorical variables".

ArturoAmorQ commented 1 year ago

I think we should use set_output(transform="pandas") by default in the notebook titled "Encoding of categorical variables".

The global setting raises an ValueError: Pandas output does not support sparse data when training the model at the end of the notebook.

We can still set the output to be dataframe when creating the instances in the rest of the notebook, and use new instances with default input for the pipeline.

glemaitre commented 1 year ago

In the notebook linear_model_regularization, I am wondering if we should advocate for trying to get the feature names from model[:-1].get_feature_names_out(...) or instead have set_output and then access model[-1].feature_names_in_.

ogrisel commented 1 year ago

+1 for model[-1].feature_names_in_ which should make the code even shorter.

glemaitre commented 1 year ago

Otherwise LGTM.