Open tiemvanderdeure opened 5 days ago
This sounds very reasonable. For the ContinuousEncoder, perhaps we can just call Float32.(...)
after the coercion. Be great to have a PR, thanks!
Would it be enough to have a single option Float32
or Float64
, or do we want more than that?
@EssamWisam Perhaps we can keep such an option in mind for the encoders you are working on. We should have a consistent name for the hyper-parameter. If it's a flag, how about use_float32
?
@tiemvanderdeure What do you think?
I think it's common to train deep learning models at 16-bit precision and sometimes even lower. Thus, I think one way to approach this may be to add a transformer (e.g., CastingTransformer
) that can cast solumn types (e.g., Float64
to Float32
or Int32
to Int16
and so on) according to user requirements.
Alternatively, I think it would be good for Flux models to do this conversion themselves (e.g., while batching). That would be easier because one can then try out their Flux model at different precisions for the parameters without also manuaully changing the precision of the data.
My initial idea was to be a bit more flexible by adding a datatype
hyper-parameter that defaults to Float64
.
Would a CastingTransformer
just transform columns into the same scitype
but a different DataType
? I can see the use case, but it would make pipelines longer than if we can integrate this into existing transformers.
I think doing the conversion in MLJFlux is also reasonable - I can't see any reason why you wouldn't want the input data to be converted to whatever datatype the neural net has. I don't know if there are any other MLJ models that have specific requirements to input data? I can't think of any on the spot, but I only know a fraction of all the models.
When I proposed having a separate transformer, the reason was it would not make sense to have the hyperparameter take only specific values for the type (e.g., only Float64 or Float32) as the model may also be Float16 or it may use an embedding layer requiring the input to be ordinal encoded integers and because a separate transformer can also be used other applications (e.g., casting specific columns of a dataset to Float16 while casting others to Int16 to save memory before feeding the data into some model).
If I understand correctly now, you meant that datatype
does not have to be restricted to specific types (e.g., user can supply the type). While I think this can work (while being tricky to implement), I still lean towards having Flux or MLJFlux be responsible for this as it can be argued that among all the models that one can insert in a pipeline, only Flux models need this; the rest only depend on the scientific type. Otherwise, developing something generic that could be used for Flux models or another purpose can also make sense; albeit, tricky as well.
Just made a quick PR to show what I had in mind - it might still be better to handle this in (MLJ)Flux instead, then we don't merge the PR.
After thinking about this more, my vote would be to add this to MLJFlux, as this is, for now, the only package fussy about float types. In the rare we want float32 in some other case, one can always insert an ordinary function into the pipeline. For example, if you have a float64 table, you could insert float32ify
, where
float32ify(X) = Float32.(Tables.matrix(X))
which returns a Float32
matrix, which MLJFlux models will accept. It seems overkill to have a new Static
transformer to generalise this kind of simple conversion, no?
In MLJFlux, I wouldn't implement the conversion by adding a layer to the Flux model. Rather do it in the data pre-processing step, so that it only needs to happen once.
In MLJFlux, I wouldn't implement the conversion by adding a layer to the Flux model. Rather do it in the data pre-processing step, so that it only needs to happen once.
Admittedly, this is a bit of a pain because the reformat
methods do not see the model
, which would carry the float type hyperparameter. 🤷🏾
Admittedly, this is a bit of a pain because the
reformat
methods do not see themodel
, which would carry the float type hyperparameter. 🤷🏾
In the current state, Flux will behave as reported above (as far as I believe) if input is FloatX and model is FloatY where X≠Y. The most frequent case for this will have X=64 and Y=32. If we make a reformat method to handle just this case, then we don't perfectly solve this problem for all setups but at least do so for the vastly most frequent setup (i.e., still an improvement over the status quo).
I am using a
LinearPipeline
consisting of aOneHotEncoder
in combination with aNeuralNetworkClassifier
, which I think is a basic yet really neat use case for linear pipelines.However,
Flux
gives this warning:I know MLJ tries to abstract away from data types to use scitypes instead, but in this case it would be great to be able to specify the data type a
OneHotEncoder
gives as output.For
OneHotEncoder
this happens to be really easy to do (it callsfloat
), for aContinuousEncoder
it would be a bit trickier (it callscoerce
).I also don't know if it conflicts too much with the scitype philosophy or if there's an easier way around this, so I thought I'd ask here before making a PR (which I would gladly do).