Open PierreGtch opened 9 months ago
Ping @tomMoral, to join the conversation
The proposed pattern couples the code that perform the evaluation (run the code + parallelization) from the process that decide the split. I would recommend to further decouple them, in light of what scikit-learn does, so that the API is similar, making it easy to grasp the various concepts.
Basically, the get_split
is serving the same functionality as the BaseCrossValidator
object in scikit-learn
.
The API works with three methods:
__init__
: this setup the parameters of the split if any.get_n_split
: this method would take a dataset and returns the number of splits (for instance with leave one subject out, the number of subjects).split
: This method is a generator, which takes the dataset as input and when iterated on, gives the train_idx, test_idx
.Taking back the Evaluation
object, you would have a single one I guess, such that:
memory = joblib.Memory(location="__cache__")
class Evaluation:
def __init__(
self,
...
n_nodes=1, # number of data chunks to load in memory in parallel.
n_jobs=1, # number of jobs per data chunk. One job fits one pipeline on one fold.
cv="intersubject",
):
self.n_nodes = n_nodes
self.n_jobs = n_jobs
if isinstance(cv, str): # make it easy if you want default parameters for cv
cv = CV_CLASSES[cv]()
self.cv= cv
def process(self, pipelines, datasets):
results = Parallel(n_jobs=self.n_jobs)(
delayed(self.process_split)(p, d, metadata, train_idx, test_idx)
for p in pipelines for d in datasets
for (train_idx, test_idx) in self.cv.split(d)
)
return pd.DataFrame(results)
@memory.cache
def process_split(self, clf, dataset, metadata, split_args, train_idx, test_idx):
clf = deepcopy(clf)
X_train, X_test, y_train, y_test, metadata = self.paradigm.get_data(
**datachunk_args, train_idx, test_idx
)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
return {'metadata': datachunk_args, 'clf': clf, 'score': score}
Note that I changed the manual caching to use joblib.Memory
which is done for caching calls a a function and I flattened the parallelism (joblib
is bad with nested parallelism).
Thanks @tomMoral for your feedback!! But not sure if this would completely work because have some quite specific constraints:
paradigm.get_data
because it loads from disk and pre-processes the data, so we would like to call it only once for all the splits and pipelines. Do you think this could be achieved through joblib.Memory
?test_idx
and train_idx
before loading the data because the only info we have about the datasets is the number of subjects they contain. We don't know the number of sessions or the number of examples per session before loading the data. Maybe we should try to change that? @bruAristimunha @sylvchev This is why I proposed this nested parallelism. Maybe an in-between would be to implement BaseCrossValidator
s but that would receive only the data of one subject as input instead of a whole dataset?
After discussions at the braindecode code sprint and following up on #460, I think we should break down the evaluations into something like that:
This would remove all the for loops we have in the different evaluations and allow for larger parallelisation.