Closed thiziri closed 6 years ago
For datasets with large number of samples, we split the data into train, validation, and test set, usually with 8:1:1 ratio. If you have a small dataset to perform 5-cross-validation, you can split the dataset into 5-folds, the function "split_train_valid_test" in matchzoo/inputs/preparation.py can help you.
Thank you @faneshion , here is the function that I added to the preparation.py file:
def split_n_cross_validation_for_ranking(relations, ratio):
def select_rel_by_qids(qids):
rels = []
qids = set(qids)
for r, q, d in relations:
if q in qids:
rels.append((r, q, d))
return rels
qid_group = set()
for r, q, d in relations:
qid_group.add(q)
qid_group = list(qid_group)
#print(qid_group)
random.shuffle(qid_group)
total_rel = len(qid_group)
folds = {}
num_valid_test = int(total_rel * (1/ratio)) # same size for test and valid folds
# num_train = int(total_rel - 2*num_valid_test)
for i in range(ratio):
qid_test = qid_group[i*num_valid_test:(i+1)*num_valid_test]
qid_valid = qid_group[(i+1)*num_valid_test:(i+2)*num_valid_test]
if i==(ratio-1):
qid_valid = qid_group[:num_valid_test]
qid_train = list(set(qid_group) - set(qid_test).union(qid_valid))
rel_train = select_rel_by_qids(qid_train)
rel_valid = select_rel_by_qids(qid_valid)
rel_test = select_rel_by_qids(qid_test)
folds[i] = (rel_test, rel_valid, rel_train)
return folds
The function returns a dictionary of ratio
folds.
Now, I would like to make matchzoo performs evaluation based on the union of the different test folds. Could you give me, please, some tips?
Thanks
why not call sklearn keras API?
I don't know how to use it with train/test/valid files required by matchzoo, how can I manage this?
@thiziri Something like:
from sklearn.model_selection import StratifiedKFold # generate datasets for K-fold CV
X = ... # your training data
y = ... # your labels
model = ... # your compiled model
# 10 fold cv
seed = 1
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
# traiin, test are the indices of your kth fold data
for train, test in k_fold.split(X, y):
model.fit(X[train], Y[train], epochs=10, batch_size=64, verbose=1)
print(model.evaluate(X[test], y[test]))
To integrate with MatchZoo, I would not use main.py
, but directly import, the model from models
folder, then custom the inputs. For instance:
from matchzoo.models import DSSM
dssm = DSSM(conf)
model = dssm.build()
# compile the model and use k-fold cv.
# you need to read code in main.py, i.e. how to create conf and compile the model
Thanks @bwanglzu it's helpful.
you're welcome, I didn't try out yet, so just pseudo code. Personally believe n-fold cross validation is, in my mind, a statistical learning legacy. For deep learning (say neural networks), train/test split is enough.
Yeah, but in case we don't have enough data for training it's better to split data into n-folds, and also, to enable comparing approaches with other models, I have to follow the literature experimental process.
@thiziri Did it work?
Hi, In fact, I didn't use your pseudo code, I run split my data to n folds in advance, then run MatchZoo n times for each testing fold, then I put all predictions together for evaluation with the trec_eval tool. Is it correct?
@thiziri Yes apparently, that's basically the same idea, but good to know it has been solved, I'll close this issue for now :)
Given a pack object, what is the best way to split it into train/validation/test sets?
@matthew-z
num_train = int(len(data_pack) * 0.8)
data_pack.shuffle(inplace=True)
train_slice = data_pack[:num_train]
test_slice = data_pack[num_train:]
MatchZoo splits the data into train, test and valid sets, hence, test data is a subset of queries (with corresponding documents) for the ranking task. I would like to perform 5-cross-validation with matchzoo in order to have all queries in the test file. Is it possible? could you please give me some indications? Thanks in advance.