NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.84k stars 900 forks source link

n-cross validation with MatchZoo #79

Closed thiziri closed 6 years ago

thiziri commented 6 years ago

MatchZoo splits the data into train, test and valid sets, hence, test data is a subset of queries (with corresponding documents) for the ranking task. I would like to perform 5-cross-validation with matchzoo in order to have all queries in the test file. Is it possible? could you please give me some indications? Thanks in advance.

faneshion commented 6 years ago

For datasets with large number of samples, we split the data into train, validation, and test set, usually with 8:1:1 ratio. If you have a small dataset to perform 5-cross-validation, you can split the dataset into 5-folds, the function "split_train_valid_test" in matchzoo/inputs/preparation.py can help you.

thiziri commented 6 years ago

Thank you @faneshion , here is the function that I added to the preparation.py file:

def split_n_cross_validation_for_ranking(relations, ratio):

    def select_rel_by_qids(qids):
        rels = []
        qids = set(qids)
        for r, q, d in relations:
            if q in qids:
                rels.append((r, q, d))
        return rels

    qid_group = set()
    for r, q, d in relations:
        qid_group.add(q)
    qid_group = list(qid_group)
    #print(qid_group)
    random.shuffle(qid_group)
    total_rel = len(qid_group)
    folds = {}
    num_valid_test = int(total_rel * (1/ratio)) # same size for test and valid folds
    # num_train = int(total_rel - 2*num_valid_test)
    for i in range(ratio):
        qid_test = qid_group[i*num_valid_test:(i+1)*num_valid_test]
        qid_valid = qid_group[(i+1)*num_valid_test:(i+2)*num_valid_test]
        if i==(ratio-1):
            qid_valid = qid_group[:num_valid_test]
        qid_train = list(set(qid_group) - set(qid_test).union(qid_valid))
        rel_train = select_rel_by_qids(qid_train)
        rel_valid = select_rel_by_qids(qid_valid)
        rel_test = select_rel_by_qids(qid_test)
        folds[i] = (rel_test, rel_valid, rel_train)
    return folds

The function returns a dictionary of ratio folds. Now, I would like to make matchzoo performs evaluation based on the union of the different test folds. Could you give me, please, some tips? Thanks

bwanglzu commented 6 years ago

why not call sklearn keras API?

thiziri commented 6 years ago

I don't know how to use it with train/test/valid files required by matchzoo, how can I manage this?

bwanglzu commented 6 years ago

@thiziri Something like:

from sklearn.model_selection import StratifiedKFold # generate datasets for K-fold CV

X =  ... # your training data
y = ... # your labels
model = ... # your compiled model 

# 10 fold cv
seed = 1
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed) 
# traiin, test are the indices of your kth fold data
for train, test in k_fold.split(X, y): 
    model.fit(X[train], Y[train], epochs=10, batch_size=64, verbose=1)
    print(model.evaluate(X[test], y[test]))

To integrate with MatchZoo, I would not use main.py, but directly import, the model from models folder, then custom the inputs. For instance:

from matchzoo.models import DSSM

dssm = DSSM(conf)
model = dssm.build()
# compile the model and use k-fold cv.
# you need to read code in main.py, i.e. how to create conf and compile the model
thiziri commented 6 years ago

Thanks @bwanglzu it's helpful.

bwanglzu commented 6 years ago

you're welcome, I didn't try out yet, so just pseudo code. Personally believe n-fold cross validation is, in my mind, a statistical learning legacy. For deep learning (say neural networks), train/test split is enough.

thiziri commented 6 years ago

Yeah, but in case we don't have enough data for training it's better to split data into n-folds, and also, to enable comparing approaches with other models, I have to follow the literature experimental process.

bwanglzu commented 6 years ago

@thiziri Did it work?

thiziri commented 6 years ago

Hi, In fact, I didn't use your pseudo code, I run split my data to n folds in advance, then run MatchZoo n times for each testing fold, then I put all predictions together for evaluation with the trec_eval tool. Is it correct?

bwanglzu commented 6 years ago

@thiziri Yes apparently, that's basically the same idea, but good to know it has been solved, I'll close this issue for now :)

matthew-z commented 5 years ago

Given a pack object, what is the best way to split it into train/validation/test sets?

uduse commented 5 years ago

@matthew-z

num_train = int(len(data_pack) * 0.8)
data_pack.shuffle(inplace=True)
train_slice = data_pack[:num_train]
test_slice = data_pack[num_train:]