patient repeats split between training and test sets

Quasars / orange-spectroscopy

Other

51 stars 59 forks source link

patient repeats split between training and test sets #407

Open JamesCameron7 opened 4 years ago

JamesCameron7 commented 4 years ago

Hello

I'm trying to split my IR spectral dataset into training and test sets for classification models.

I have nine IR spectra per patient (9 per sample), and when i use the data sampler widget to split the dataset, spectra from individual patients are being included both the training and test sets.

Is there a way to stop this and ensure all 9 spectra from one patient stay together in either the training or the test? I've attached my workflow in case there is anything that i am missing.

Any advice would be great!

Many thanks James Training and Test Tables Quasar workflow

markotoplak commented 4 years ago

Use "Cross validation by feature" option of Test and Score. To use it, you will need to convert your ID variable from String to Categorial (use the Edit domain widget). Also, because "Cross validation by feature" already performs CV, you need remove the Data Sampler and connect all your data to Test and Score.

JamesCameron7 commented 4 years ago

Thanks Marko, in that case is it not possible to 'test on test data' as you aren't splitting the dataset prior to cross-validation?

markotoplak commented 4 years ago

"Cross validation by feature" is a leave-a-patient-out type of cross validation, which is testing on data that was not used in learning, so it should be fine methodologically. Yes, because you can not set train/test set proportions, time complexity can suffer, but I do not see other drawbacks.

Does that suffice? If not, I am interested to hear what are you trying to do.

JamesCameron7 commented 4 years ago

thanks for clarifying! I don't generally use leave-one-out cross validation but it makes sense to me now.

What i normally do is split the the dataset into training and test (70/30) and randomly resample the splits a number of times (say 50 iterations). Then predict on each resampled test set and the average of the multiple model iterations is reported in terms of sens/spec/accuracy etc.

therefore it would be cool if there was the option to stratify patient id's in the data sampler widget so i could compare to my previous analysis, but for now i'll try out the cross validation by feature!

markotoplak commented 4 years ago

So, if I understand you, you would need a Data sampler that is aware of feature groups?

JamesCameron7 commented 4 years ago

Yes I believe so.

Similar to the 'Cross-validation by feature' option, if you had say, "split based on feature" within the Data Sampler options, where the Feature would be the ID category, then i think this would work could work?