Open chouchou1988 opened 7 years ago
yeah,i also wonder how the data format is......
Same question here
In author's post http://liaha.github.io/models/2016/06/21/dssm-on-tensorflow.html , he says that the model input is 46238:1 24108:1 24016:1 5618:1 8818:1
, which stands for tri-letter index: num_occur
, but I confused that we must pre-process all the tri-letters to build their indexes? That seems time consuming......
https://github.com/liaha/dssm/blob/master/single/dssm_v3.py
I think the pull_batch function seems to accept already pre-processed(query/title -> n-gram vector -> one-hot encoded vector) data as input. And the input is simply a one-hot encoded vector like [[1, 0, 1, ...., 0], ..., [1, 0, 1, ...., 0]].
So you need to convert query and document title to one-hot encoded vector before feed to tensor.
In my case, I used scikit-learn for it. http://scikit-learn.org/stable/modules/feature_extraction.html
Actually after reading the code, I think the original file format doesn't matter. If you are converting your tri-gram data into a sparse matrix then it should be fine. Just change the lines handling input data in the code. The training model handles matrices anyway.
yeah, I also have a question that the data format is like .......
I am saving my data as a matrix, where the rows are the query/document sentences, and the columns is the vocabulary. I am adding a '1' wherever the word in the sentence matches with the word in the vocab.
For example, if my query files contain "this is a cat. hello cat". My vocab comprises of "this, is, a, cat, hello". Then my query matrix is like:
1, 1, 1, 1, 0
0, 0, 0, 1, 1
I am creating a sparse matrix out of it by using scipy.sparse.csr_matrix()
Am I doing this right?
I have seen your demo dssm/single/dssm_v3.py, and want to know how your data be organazed. For example, what the format of query.train.pickle ?