The input data format of corpus_preprocessed.txt

NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.

Apache License 2.0

3.84k stars 897 forks source link

The main function in preprocess.py is just an case to show how to use the Preprocess class. For each dataset, the preprocess is done in the MatchZoo/data directory, for example, you can find the MatchZoo/data/WikiQA/prepare_mz_data.py where the length have been recorded as follows:

    fout = open(dstdir + 'corpus_preprocessed.txt', 'w')
    for inum, did in enumerate(dids):
        fout.write('%s %s %s\n' % (did, len(docs[inum]), ' '.join(map(str, docs[inum]))))
    fout.close()
    print('Preprocess finished ...')

NTMC-Community / MatchZoo

The input data format of corpus_preprocessed.txt #258