How does your train_data for dssm organazed? Or the data format

fwd4 / dssm

184 stars 86 forks source link

How does your train_data for dssm organazed? Or the data format #2

Open chouchou1988 opened 7 years ago

chouchou1988 commented 7 years ago

I have seen your demo dssm/single/dssm_v3.py, and want to know how your data be organazed. For example, what the format of query.train.pickle ?

RominYue commented 7 years ago

yeah，i also wonder how the data format is......

BinQuake commented 7 years ago

Same question here

RominYue commented 7 years ago

In author's post http://liaha.github.io/models/2016/06/21/dssm-on-tensorflow.html , he says that the model input is 46238:1 24108:1 24016:1 5618:1 8818:1, which stands for tri-letter index: num_occur, but I confused that we must pre-process all the tri-letters to build their indexes? That seems time consuming......

sehoi commented 7 years ago

https://github.com/liaha/dssm/blob/master/single/dssm_v3.py

I think the pull_batch function seems to accept already pre-processed(query/title -> n-gram vector -> one-hot encoded vector) data as input. And the input is simply a one-hot encoded vector like [[1, 0, 1, ...., 0], ..., [1, 0, 1, ...., 0]].

So you need to convert query and document title to one-hot encoded vector before feed to tensor.

In my case, I used scikit-learn for it. http://scikit-learn.org/stable/modules/feature_extraction.html

BinQuake commented 7 years ago

Actually after reading the code, I think the original file format doesn't matter. If you are converting your tri-gram data into a sparse matrix then it should be fine. Just change the lines handling input data in the code. The training model handles matrices anyway.

zhongyunuestc commented 7 years ago

yeah， I also have a question that the data format is like .......

RobbLang commented 5 years ago

I am saving my data as a matrix, where the rows are the query/document sentences, and the columns is the vocabulary. I am adding a '1' wherever the word in the sentence matches with the word in the vocab.

For example, if my query files contain "this is a cat. hello cat". My vocab comprises of "this, is, a, cat, hello". Then my query matrix is like:

1, 1, 1, 1, 0
0, 0, 0, 1, 1

I am creating a sparse matrix out of it by using scipy.sparse.csr_matrix()

Am I doing this right?