basf / MolPipeline

MIT License
89 stars 4 forks source link

Add stacked Piplines #19

Open c-w-feldmann opened 2 months ago

c-w-feldmann commented 2 months ago
mol_pipeline = MolPipeline(
    [
        ("smi2mol", SmilesToMolPipelineElement()),
        ("mol2cleanmol", MelloddyPipelineElement()),
        ("mol2phys", MolToRDKitPhysChem(normalize=True)),
    ],
    n_jobs=n_jobs,
    memory=joblib.Memory(),
    none_handling = "fill_dummy",
)

rf_pipeline = Pipeline(
    [
        ("molpipline", mol_pipeline ),
        ("rf", RandomForestRegressor(n_estimators = 100, random_state = 42,oob_score=True)),
    ],
    n_jobs=n_jobs,
    memory=joblib.Memory(),
)

instead of

mol_pipeline = MolPipeline(
    [
        ("smi2mol", SmilesToMolPipelineElement()),
        ("mol2cleanmol", MelloddyPipelineElement()),
        ("mol2phys", MolToRDKitPhysChem(normalize=True)),
    ],
    n_jobs=n_jobs,
    memory=joblib.Memory(),
    none_handling = "fill_dummy",
)

rf_pipeline = Pipeline(
    [
        ("smi2mol", SmilesToMolPipelineElement()),
        ("mol2cleanmol", MelloddyPipelineElement()),
        ("mol2phys", MolToRDKitPhysChem()),
        ("rf", RandomForestRegressor(n_estimators = 100, random_state = 42,oob_score=True)),
    ],
    n_jobs=n_jobs,
    memory=joblib.Memory(),
)

Could perhaps work if one checks during init which type the pipeline element has? Could be an issue with hyperparams? Alternative: molpipeline.stack_pipelines(list_of_pipelines)