Add an sklearn-like interface for creating pipelines in terms of fitting, projection, and combining operations on matrices.
Current thoughts:
I deviated from the design docs a little bit to make the inheritance make more sense. I made a default PipelineBase, with Pipeline inheriting form it. I then made a PipelineStep that inherits from PipelineBase. Estimator and Transformer both inherit from this class.
PipelineBase and Pipeline differ because Pipelines should have steps. This isn't true for single PipelineSteps
PipelineBase and PipelineStep differ to indicate each step has a step_name associated with it. Also to allow for shared interface for transformers/predictors in how they are printed and how they can be concatenated to create a pipeline
I had to change transform() to project(), given I found a generic base function with the same name. Additionally, I found predict() in the stats package, and changed it to estimate()
Tests will be added in another sister PR, as there are no transformers/estimators that are built to test functionality here.
I'm not sure which methods I should provide detail to, given that we are not sure how much of this we want to expose. I provided them to the generics themselves, to allow for a meta-look on how to use methods in both PipelineSteps and Pipeline. However, it isn't clear to me whether I need to continue providing an extensive docstring for every overriden method in child classes.
I'm not sure which Classes I should be exposing to the reference either. I found that previous BPCells classes (ie IterableMatrix) aren't heavily described in the reference. I provided some information on Pipeline, Estimator, and Transformer, and exposed them to the reference page. I also tried to provide information on how to create a Transformer, and Estimator yourself on the docstring.
I don't think I'm completely sold on using the show() method as an analog of the python __repr__() dunder. I think it could be more useful to make it act more similarly to what you used to display IterableMatrix, ie where we still have information on what steps are in a pipeline, but also macro information, like hyper params or details on what the step has fit to. In this case, we would have a __repr__() analog somewhere else.
Probably redundant to have both project() and estimate(). What do you think for just combining them into one?
Add an sklearn-like interface for creating pipelines in terms of fitting, projection, and combining operations on matrices.
Current thoughts:
I deviated from the design docs a little bit to make the inheritance make more sense. I made a default
PipelineBase
, withPipeline
inheriting form it. I then made aPipelineStep
that inherits from PipelineBase.Estimator
andTransformer
both inherit from this class.PipelineBase
andPipeline
differ because Pipelines should have steps. This isn't true for single PipelineStepsPipelineBase
andPipelineStep
differ to indicate each step has a step_name associated with it. Also to allow for shared interface for transformers/predictors in how they are printed and how they can be concatenated to create a pipelineI had to change
transform()
toproject()
, given I found a generic base function with the same name. Additionally, I foundpredict()
in the stats package, and changed it toestimate()
Tests will be added in another sister PR, as there are no transformers/estimators that are built to test functionality here.
I'm not sure which methods I should provide detail to, given that we are not sure how much of this we want to expose. I provided them to the generics themselves, to allow for a meta-look on how to use methods in both
PipelineSteps
andPipeline
. However, it isn't clear to me whether I need to continue providing an extensive docstring for every overriden method in child classes.I'm not sure which Classes I should be exposing to the reference either. I found that previous BPCells classes (ie IterableMatrix) aren't heavily described in the reference. I provided some information on
Pipeline
,Estimator
, andTransformer
, and exposed them to the reference page. I also tried to provide information on how to create aTransformer
, andEstimator
yourself on the docstring.I don't think I'm completely sold on using the
show()
method as an analog of the python__repr__()
dunder. I think it could be more useful to make it act more similarly to what you used to displayIterableMatrix
, ie where we still have information on what steps are in a pipeline, but also macro information, like hyper params or details on what the step has fit to. In this case, we would have a__repr__()
analog somewhere else.Probably redundant to have both
project()
andestimate()
. What do you think for just combining them into one?