JuliaAI / MLJModels.jl

Home of the MLJ model registry and tools for model queries and mode code loading
MIT License
79 stars 27 forks source link

allow users to specify float types from OneHotEncoder and ContinuousEncoder #565

Open tiemvanderdeure opened 5 days ago

tiemvanderdeure commented 5 days ago

I am using a LinearPipeline consisting of a OneHotEncoder in combination with a NeuralNetworkClassifier, which I think is a basic yet really neat use case for linear pipelines.

However, Flux gives this warning:

┌ Warning: Layer with Float32 parameters got Float64 input.
│   The input will be converted, but any earlier layers may be very slow.
│   layer = Dense(11 => 16, relu)  # 192 parameters
│   summary(x) = "11×20 Matrix{Float64}"
└ @ Flux ~/.julia/packages/Flux/CUn7U/src/layers/stateless.jl:60

I know MLJ tries to abstract away from data types to use scitypes instead, but in this case it would be great to be able to specify the data type a OneHotEncoder gives as output.

For OneHotEncoder this happens to be really easy to do (it calls float), for a ContinuousEncoder it would be a bit trickier (it calls coerce).

I also don't know if it conflicts too much with the scitype philosophy or if there's an easier way around this, so I thought I'd ask here before making a PR (which I would gladly do).

ablaom commented 4 days ago

This sounds very reasonable. For the ContinuousEncoder, perhaps we can just call Float32.(...) after the coercion. Be great to have a PR, thanks!

Would it be enough to have a single option Float32 or Float64, or do we want more than that?

@EssamWisam Perhaps we can keep such an option in mind for the encoders you are working on. We should have a consistent name for the hyper-parameter. If it's a flag, how about use_float32?

@tiemvanderdeure What do you think?

EssamWisam commented 4 days ago

I think it's common to train deep learning models at 16-bit precision and sometimes even lower. Thus, I think one way to approach this may be to add a transformer (e.g., CastingTransformer) that can cast solumn types (e.g., Float64 to Float32 or Int32 to Int16 and so on) according to user requirements.

Alternatively, I think it would be good for Flux models to do this conversion themselves (e.g., while batching). That would be easier because one can then try out their Flux model at different precisions for the parameters without also manuaully changing the precision of the data.

tiemvanderdeure commented 4 days ago

My initial idea was to be a bit more flexible by adding a datatype hyper-parameter that defaults to Float64.

Would a CastingTransformer just transform columns into the same scitype but a different DataType? I can see the use case, but it would make pipelines longer than if we can integrate this into existing transformers.

I think doing the conversion in MLJFlux is also reasonable - I can't see any reason why you wouldn't want the input data to be converted to whatever datatype the neural net has. I don't know if there are any other MLJ models that have specific requirements to input data? I can't think of any on the spot, but I only know a fraction of all the models.

EssamWisam commented 3 days ago

When I proposed having a separate transformer, the reason was it would not make sense to have the hyperparameter take only specific values for the type (e.g., only Float64 or Float32) as the model may also be Float16 or it may use an embedding layer requiring the input to be ordinal encoded integers and because a separate transformer can also be used other applications (e.g., casting specific columns of a dataset to Float16 while casting others to Int16 to save memory before feeding the data into some model).

If I understand correctly now, you meant that datatype does not have to be restricted to specific types (e.g., user can supply the type). While I think this can work (while being tricky to implement), I still lean towards having Flux or MLJFlux be responsible for this as it can be argued that among all the models that one can insert in a pipeline, only Flux models need this; the rest only depend on the scientific type. Otherwise, developing something generic that could be used for Flux models or another purpose can also make sense; albeit, tricky as well.

tiemvanderdeure commented 3 days ago

Just made a quick PR to show what I had in mind - it might still be better to handle this in (MLJ)Flux instead, then we don't merge the PR.

ablaom commented 2 days ago

After thinking about this more, my vote would be to add this to MLJFlux, as this is, for now, the only package fussy about float types. In the rare we want float32 in some other case, one can always insert an ordinary function into the pipeline. For example, if you have a float64 table, you could insert float32ify, where

float32ify(X) = Float32.(Tables.matrix(X))

which returns a Float32 matrix, which MLJFlux models will accept. It seems overkill to have a new Static transformer to generalise this kind of simple conversion, no?

In MLJFlux, I wouldn't implement the conversion by adding a layer to the Flux model. Rather do it in the data pre-processing step, so that it only needs to happen once.

ablaom commented 2 days ago

In MLJFlux, I wouldn't implement the conversion by adding a layer to the Flux model. Rather do it in the data pre-processing step, so that it only needs to happen once.

Admittedly, this is a bit of a pain because the reformat methods do not see the model, which would carry the float type hyperparameter. 🤷🏾

EssamWisam commented 1 day ago

Admittedly, this is a bit of a pain because the reformat methods do not see the model, which would carry the float type hyperparameter. 🤷🏾

In the current state, Flux will behave as reported above (as far as I believe) if input is FloatX and model is FloatY where X≠Y. The most frequent case for this will have X=64 and Y=32. If we make a reformat method to handle just this case, then we don't perfectly solve this problem for all setups but at least do so for the vastly most frequent setup (i.e., still an improvement over the status quo).