allenai / allennlp

An open-source NLP research library, built on PyTorch.
http://www.allennlp.org
Apache License 2.0
11.76k stars 2.25k forks source link

K fold cross validation #1912

Closed abhishekraok closed 6 years ago

abhishekraok commented 6 years ago

It would be great if we could have K-fold cross validation . For input parameters we can specify the training data, no validation data, and a new parameter called k-fold validation, with any k>0 indicating we must perform k-fold validation.

Perhaps one can use the KFold function from sklearn.

abhishekraok commented 6 years ago

I would be happy to work on this with some guidance

matt-gardner commented 6 years ago

@abhishekraok, thanks for the offer to help! This feels like a big piece of work, and it needs some design thought before anyone should start working on it, though. I'm not sure what the right way is to implement this. One option is to just have something of a wrapper around our train command, where we split the data files for you, run train a bunch of times, and aggregate the results. This would be extremely slow and expensive for a large model and a large dataset, which is why people don't do cross validation very much with neural nets.

Another option is to just assume that if you're using cross validation, you have a small dataset that fits in memory and doesn't take too long to train, so we can get around having to write separate files. Then you would just write a script that loads the data, splits it, then does most of the work from the train command, once for each fold, and compiles the result.

These two options might sound similar, but one of them actually writes things to disk and can literally re-use the train command in a shell script, the other one keeps things in memory and writes python code that's similar to what's in train.py, but a little bit different. Which one you pick really depends on how much data you have.

abhishekraok commented 6 years ago

In my case the dataset is quite small and can fit into memory. So second option seems good.

ruleGreen commented 5 years ago

Hello, I want to know if I want to do k-fold cross validation, but I run the model with json config file, what should I do to do k-fold? I think if I change the config file every time when I split the datasets, I get the unique model each time, but I do not know how to ensemble them together at the end. Or there is other way to do that?