NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.84k stars 897 forks source link

The input data format of corpus_preprocessed.txt #258

Closed EVASHINJI closed 6 years ago

EVASHINJI commented 6 years ago

In MatchZoo/data/toy_example/readme.md corpus_preprocessed.txt: Each line is corresponding to a document. The first column is document id. The second column is the document length, followed by the ids of words in this document.

But in MatchZoo/matchzoo/inputs/preprocess.py line 510

fout = open(basedir + 'corpus_preprocessed.txt', 'w')
    for inum, did in enumerate(dids):
        fout.write('%s\t%s\n' % (did, ' '.join(map(str, docs[inum]))))
fout.close()

the preprocess.py save the corpus_preprocessed.txt without docment length.

So is there some mistakes in preprocess.py?

faneshion commented 6 years ago

The main function in preprocess.py is just an case to show how to use the Preprocess class. For each dataset, the preprocess is done in the MatchZoo/data directory, for example, you can find the MatchZoo/data/WikiQA/prepare_mz_data.py where the length have been recorded as follows:

    fout = open(dstdir + 'corpus_preprocessed.txt', 'w')
    for inum, did in enumerate(dids):
        fout.write('%s %s %s\n' % (did, len(docs[inum]), ' '.join(map(str, docs[inum]))))
    fout.close()
    print('Preprocess finished ...')