analysiscenter / cardio

CardIO is a library for data science research of heart signals
https://analysiscenter.github.io/cardio/
Apache License 2.0
248 stars 78 forks source link

How to pass train and validation data together? #36

Closed AnoopRKulkarni closed 4 years ago

AnoopRKulkarni commented 4 years ago

Hello,

Typically in Keras, when we use fit_generator(), I can pass both training and validation batch generators while tracking validation accuracy.

Is there a way I can do this using batchflow for example? I see all examples suggest we use (dataset.train >> pipeline).run() only. Is there a way of passing dataset.train and dataset.test to KerasModel to get validation accuracy during training?

thanks in advance, ~anoop

AnoopRKulkarni commented 4 years ago

UPDATE:

One of the workarounds that I used was to pass the entire "dataset" to the pipeline and defined a new function, "def train_with_validate()" inside KerasModel class on the lines of the "def train()" and used "self.fit(.., validation_split=0.2)" there instead of the "self.train_on_batch()".

Or, should I simply use "test_on_batch()" in the pipeline after training?

thanks ~anoop

dpodvyaznikov commented 4 years ago

Hi, @DrAnoopKulkarni !

The workaround you've suggested might result in invalid validation score estimation in case you use shuffling, change composition of batch or order of items in batch in any way. Here is why: upon each iteration of run you pass one batch to self.fit method, and last 20% of items in the batch are used for validation. But in case you use, e.g., run(..., shuffle=True), those items might have been on different positions in previous batches, thus occurring in train subset of self.fit.

There are two ways to track validation score without changing the library's code. 1) Use batchflow.research module. It allows you to estimate model's performance while training using additional pipelines; conveniently train models with different hyperparameters; train models several times to evaluate how stable the results are. See tutorial here. 2) Define train and validation pipelines, and perform run for each of them in a loop, e.g.:

train = ...
validation = ...
for _ in range(K):
    train.run(..., n_epochs=N)
    validation.run(...)
AnoopRKulkarni commented 4 years ago

Thank you @dpodvyaznikov

After a few runs I realized I wasnt getting the results I was looking for with my approach, and thanks for explaining why that would be so.

I will take a look at the research module in details in terms of its philosophy and usage.

However, for now, in my limited requirement, guess your second approach will work.

Thanks again for your thoughts. Appreciate them.

Best regards, ~anoop