This PR mainly introduces the amltk.sklearn.CVEvaluator. This is something that can create a Task[[Trial, Node], Trial.Report] that can be optimized against for a proto-typical sklearn setup.
from amltk.sklearn import CVEvaluation
from amltk.pipeline import Component, request
from amltk.optimization import Metric
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import get_scorer
pipeline = Component(
RandomForestClassifier,
config={"random_state": request("random_state")},
space={"n_estimators": (10, 100), "critera": ["gini", "entropy"]},
)
evaluator = CVEvaluation(
X,
y,
cv=3,
additional_scorers={"f1": get_scorer("f1")},
store_models=False,
train_score=True,
)
history = pipeline.optimize(
target=evaluator,
metrics=Metric("accuracy", minimize=False, bounds=(0, 1)),
n_workers=4,
)
print(history.df())
Namely its parameters features:
Sensible defaults for splitting based on strategy: Literal["holdout", "cv"] or pass a custom splitter.
Options to train_score: bool = False or store_models: bool = False
Pass additional_scorers: dict[str, _Scorer] for metrics to track other than optimization ones attached to the Trial.
This uses scikit-learn>=1.4 and likely means this will enforce a lower bound. I do not want to maintain backwards compatibility.
task_hint: bool | None = None to hint on the task type. This comes up in the AutoML Benchmark where sklearn incorrectly identifies some targets as regression.
What this gains is correctly setting up tasks, such as serializing data to be passed to workers, managing memory, i.e. not holding all splits/models in memory at once, as well as interacting with sklearn properly, such as setting seeds.
Additional
Move away from StoredValue to just having a Stored[T] which you can call load() on. Useful for situations where you don't care about where, just give me the T
TODOs
Tests parameters like groups, sample weights and scorers. Will move this to its own issue for the sake of marching onwards.
Test clustering, theoretically this shouldn't be problematic except for the identify task type part. However the cross_validate doesn't care
This PR mainly introduces the
amltk.sklearn.CVEvaluator
. This is something that can create aTask[[Trial, Node], Trial.Report]
that can be optimized against for a proto-typical sklearn setup.Namely its parameters features:
strategy: Literal["holdout", "cv"]
or pass a custom splitter.train_score: bool = False
orstore_models: bool = False
additional_scorers: dict[str, _Scorer]
for metrics to track other than optimization ones attached to theTrial
.params: dict[str, Any]
that use sklearns new metadata routing for things likesample_weight
andscorer
params.scikit-learn>=1.4
and likely means this will enforce a lower bound. I do not want to maintain backwards compatibility.task_hint: bool | None = None
to hint on the task type. This comes up in the AutoML Benchmark where sklearn incorrectly identifies some targets as regression.What this gains is correctly setting up tasks, such as serializing data to be passed to workers, managing memory, i.e. not holding all splits/models in memory at once, as well as interacting with sklearn properly, such as setting seeds.
Additional
StoredValue
to just having aStored[T]
which you can callload()
on. Useful for situations where you don't care about where, just give me theT
TODOs
cross_validate
doesn't care