Closed denisb411 closed 5 years ago
@denisb411 take a look at this image, you'll get the point:
Basically:
@bwanglzu Thanks, but I didn't get it. What are x and y supposed to be?
@denisb411 It's a bit tricky to explain machine learning for information retrieval in short, but x usually refers to training data
, i.e. (query, document)
pair, y usually refers to labels (the relevance degree annotated by human-beings).
Given x -> (queries, documents)
and y -> labels
, we usually employ a machine learning (or deep learning algorithm) to fit a model to minimize the loss. To this end, we'll get an optimized trained model. At least, it should works well on training data (And we believe it should also works well on unseen/new data).
From the engineering perspective, once we believe the model is "optimized", we persist the model and integrate the model into the search engine. When a user submit a new query, we use the model to predict the similarity score between the query and all documents.
However, to my knowledge, the neural networks trained model is not ready for production yet. Microsoft has already integrated statistical machine learning based (learning-to-rank) model into Bing. (I don't know the case for Google, but Google usually works well because they've a huge number of user feedback).
If you're interested, you can take a look at this online course text retrieval and search engines to get a deeper understanding about this.
@bwanglzu Thanks a lot for your explanation. It was just a little bit hard to me that it covers a supervised scenario. I thought that we could train a model with just the documents, something like x, y = document, document and latter on predict with x, y = document, query. Don't know if it works tho.
- For each new query, do pre-processing.
- Use the trained model to predict the similarity score between pre-processed query and each document, result in a score list.
@bwanglzu But prediction relies on a relation file? I just trry with DRMM_Listgenerator to predict a new query. If I have preprocessed the query. Now, how do I go over the relation IDS in the batch generator below?
def get_batch(self):
while self.point < self.num_list:
currbatch = []
if self.point + self.batch_list <= self.num_list:
currbatch = self.list_list[self.point: self.point + self.batch_list]
self.point += self.batch_list
else:
currbatch = self.list_list[self.point:]
self.point = self.num_list
bsize = sum([len(pt[1]) for pt in currbatch])
list_count = [0]
ID_pairs = []
X1 = np.zeros((bsize, self.data1_maxlen), dtype=np.int32)
X1_len = np.zeros((bsize,), dtype=np.int32)
X2 = np.zeros((bsize, self.data1_maxlen, self.hist_size), dtype=np.float32)
X2_len = np.zeros((bsize,), dtype=np.int32)
Y = np.zeros((bsize,), dtype= np.int32)
X1[:] = self.fill_word
j = 0
for pt in currbatch:
# print(pt)
d1, d2_list = pt[0], pt[1]
print(d1)
print(self.data1)
d1_cont = list(self.data1[d1])
d1_len = min(self.data1_maxlen, len(d1_cont))
list_count.append(list_count[-1] + len(d2_list))
for l, d2 in d2_list:
X1[j, :d1_len], X1_len[j] = d1_cont[:d1_len], d1_len
d2_cont = list(self.data2[d2])
d2_len = len(d2_cont)
X2[j], X2_len[j] = self.cal_hist(d1, d2, self.data1_maxlen, self.hist_size), d2_len
ID_pairs.append((d1, d2))
Y[j] = l
j += 1
yield X1, X1_len, X2, X2_len, Y, ID_pairs, list_count
@datistiquo Are you using matchzoo 1.0?
yes
I thought such predictions are handled with matczoo? But I think you must need to customize such that you overgo doc IDs and manualy add word IDs from the word dict to each word in the query. But then is the question can you feed easily the score matrix for each query with each doc?
@datistiquo Yes, the predictions are handled with matchzoo, you can consider matchzoo as a research platform that compare "which algorithm has the best performance for text matching".
I recommend you try out the latest version of Matchzoo (2.1), it will take over all the preprocessing tasks and much easier for you to build your model.
Yes, the predictions are handled with matchzoo,
I don't know If you understand or may I have misunderstandings. If it would be easy there were not so much issues here? Predictions are not coming out of the box. List generator relies on relation file. Same in v2. How should I format my new query since I need the structure of left and right text in v2...
There is no guidance of the structure format for new predictions! So I struggle reading the code to customize. But you say it is out of the box?
Out of the box would be:
model.predict(new_query.txt)
wit the new query.
It would be very nice if could write just a word the way I need to do. I am in a bubble right now, because I get no clear answer.
Format requires relation file with the query togetehr with positive and negative document. But how should I know that for a new query? You see my problem? And that is the same as in v2 I saw.
- Use the trained model to predict the similarity score between pre-processed query and each document, result in a score list.
- Sort the score list.
To do this I need a format like'('T34', [(1, 'T35'), (0, 'T37'), (0, 'T36')])'
. So you need to hack around the doc IDs somehow.
Could you please sketch how I would do your point 4. and 5. with Matchzoo?
@bwanglzu What I thought I need to do is the following:
For a single new query I manually create a file where in each line I have the same query with it word IDS together with two docs (each also with its word IDS). Every line of the same query together with two documents is filled simply from the begining of my docs in steps of two. In this way I will fill the curr_batch from above code customizing to overcome doc IDs?
So, is this thought way doing new predcitions and is it like you meant to create the score list?
This is very hacky and you need this to adapt for every model for testing.
Is this like the way you described in your above answer?
@datistiquo I have no clue what is a relation file and the IDS you mentioned above, can you answer @faneshion @pl8787 @yangliuy ?
@bwanglzu Serious? 😄 I am sorry but I thought you know more than me...
Ok then maybe you explain the point of creating the score list like you did (I assume in v2)? Just few words such that I can find out myself. As it is all about of course is about new query with no ID...
@datistiquo see #684
@datistiquo :) I didn't participate the development of 1.0, that's the trick...
@datistiquo I think your question maybe relevant to how to generate the relation files for a new query. Firstly you can refer to the script and related functions for generating the relation files here. Then you should understand the differences between the training process and prediction/testing process. I think what you mean by "new query" is the queries in testing process, which are the queries in the testing data. Thus the relation files and IDs for them can be generated using the same functions in here. Finally, the IDs of text in v1 are automatically generated with a help of a hashmap to make sure different texts with the same content have the same IDs. But you can modify the code to fit your requirement. For example, you can reuse the text IDs in your raw data. Step 4 and Step 5 can be done with just the textual content of queries and candidate documents. Text IDs just help process the data in a more convenient way.
The statement "x, y = document, document and latter on predict with x, y = document, query" may have some problem. Where is the label information ? Why do you want to train a ranking model with doc-doc pairs? Are you interested in the matching between long texts?
Is it the idea of text mathing like DRRm to get the most relevant documents for a new query? 😄
What I was confused is that prediction is actually equal to testing...and in config file the prediction phase is related to training phase. Prediction is done using text1_corpus
and text2_corpus
! The intuition would to have there your new data.
I have now a good trick doing real prediciton. You just use the relation file under prediction to create your batches. You use the format of relation with each line have the same new query (with its word ID instead of a doc ID!) and in the last column the doc ID from your doc ID list you can takeover. I suppose you can hack in this way something out...
I see that the models usually needs a text1 and text2 to perform the training and predictions. Usually on search engines I just need the text2 (document) to perform the indexing step (training).
How can I train the model like a search engine? i.e. I don't have the text1 information (query/question) and I want to index my documents.
Does using the same text for text1 and text2 works for training?