NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.84k stars 897 forks source link

Using a model as a search engine #426

Closed denisb411 closed 5 years ago

denisb411 commented 5 years ago

I see that the models usually needs a text1 and text2 to perform the training and predictions. Usually on search engines I just need the text2 (document) to perform the indexing step (training).

How can I train the model like a search engine? i.e. I don't have the text1 information (query/question) and I want to index my documents.

Does using the same text for text1 and text2 works for training?

bwanglzu commented 5 years ago

@denisb411 take a look at this image, you'll get the point:

learning-to-rank

Basically:

  1. Train your model using training data (the training data consist of query-document-label triple).
  2. Load the trained model.
  3. For each new query, do pre-processing.
  4. Use the trained model to predict the similarity score between pre-processed query and each document, result in a score list.
  5. Sort the score list.
denisb411 commented 5 years ago

@bwanglzu Thanks, but I didn't get it. What are x and y supposed to be?

bwanglzu commented 5 years ago

@denisb411 It's a bit tricky to explain machine learning for information retrieval in short, but x usually refers to training data, i.e. (query, document) pair, y usually refers to labels (the relevance degree annotated by human-beings).

Given x -> (queries, documents) and y -> labels, we usually employ a machine learning (or deep learning algorithm) to fit a model to minimize the loss. To this end, we'll get an optimized trained model. At least, it should works well on training data (And we believe it should also works well on unseen/new data).

From the engineering perspective, once we believe the model is "optimized", we persist the model and integrate the model into the search engine. When a user submit a new query, we use the model to predict the similarity score between the query and all documents.

However, to my knowledge, the neural networks trained model is not ready for production yet. Microsoft has already integrated statistical machine learning based (learning-to-rank) model into Bing. (I don't know the case for Google, but Google usually works well because they've a huge number of user feedback).

If you're interested, you can take a look at this online course text retrieval and search engines to get a deeper understanding about this.

denisb411 commented 5 years ago

@bwanglzu Thanks a lot for your explanation. It was just a little bit hard to me that it covers a supervised scenario. I thought that we could train a model with just the documents, something like x, y = document, document and latter on predict with x, y = document, query. Don't know if it works tho.

datistiquo commented 5 years ago
  • For each new query, do pre-processing.
  • Use the trained model to predict the similarity score between pre-processed query and each document, result in a score list.

@bwanglzu But prediction relies on a relation file? I just trry with DRMM_Listgenerator to predict a new query. If I have preprocessed the query. Now, how do I go over the relation IDS in the batch generator below?

    def get_batch(self):
        while self.point < self.num_list:
            currbatch = []
            if self.point + self.batch_list <= self.num_list:
                currbatch = self.list_list[self.point: self.point + self.batch_list]
                self.point += self.batch_list
            else:
                currbatch = self.list_list[self.point:]
                self.point = self.num_list
            bsize = sum([len(pt[1]) for pt in currbatch])
            list_count = [0]
            ID_pairs = []
            X1 = np.zeros((bsize, self.data1_maxlen), dtype=np.int32)
            X1_len = np.zeros((bsize,), dtype=np.int32)
            X2 = np.zeros((bsize, self.data1_maxlen, self.hist_size), dtype=np.float32)
            X2_len = np.zeros((bsize,), dtype=np.int32)
            Y = np.zeros((bsize,), dtype= np.int32)
            X1[:] = self.fill_word
            j = 0
            for pt in currbatch:
              #  print(pt)
                d1, d2_list = pt[0], pt[1]
                print(d1)
                print(self.data1)
                d1_cont = list(self.data1[d1])
                d1_len = min(self.data1_maxlen, len(d1_cont))
                list_count.append(list_count[-1] + len(d2_list))
                for l, d2 in d2_list:
                    X1[j, :d1_len], X1_len[j] = d1_cont[:d1_len], d1_len
                    d2_cont = list(self.data2[d2])
                    d2_len = len(d2_cont)
                    X2[j], X2_len[j] = self.cal_hist(d1, d2, self.data1_maxlen, self.hist_size), d2_len
                    ID_pairs.append((d1, d2))
                    Y[j] = l
                    j += 1
            yield X1, X1_len, X2, X2_len, Y, ID_pairs, list_count
bwanglzu commented 5 years ago

@datistiquo Are you using matchzoo 1.0?

datistiquo commented 5 years ago

yes

datistiquo commented 5 years ago

I thought such predictions are handled with matczoo? But I think you must need to customize such that you overgo doc IDs and manualy add word IDs from the word dict to each word in the query. But then is the question can you feed easily the score matrix for each query with each doc?

bwanglzu commented 5 years ago

@datistiquo Yes, the predictions are handled with matchzoo, you can consider matchzoo as a research platform that compare "which algorithm has the best performance for text matching".

I recommend you try out the latest version of Matchzoo (2.1), it will take over all the preprocessing tasks and much easier for you to build your model.

datistiquo commented 5 years ago

Yes, the predictions are handled with matchzoo,

I don't know If you understand or may I have misunderstandings. If it would be easy there were not so much issues here? Predictions are not coming out of the box. List generator relies on relation file. Same in v2. How should I format my new query since I need the structure of left and right text in v2...

There is no guidance of the structure format for new predictions! So I struggle reading the code to customize. But you say it is out of the box?

Out of the box would be:

model.predict(new_query.txt)

wit the new query.

datistiquo commented 5 years ago

It would be very nice if could write just a word the way I need to do. I am in a bubble right now, because I get no clear answer.

Format requires relation file with the query togetehr with positive and negative document. But how should I know that for a new query? You see my problem? And that is the same as in v2 I saw.

  1. Use the trained model to predict the similarity score between pre-processed query and each document, result in a score list.
  2. Sort the score list.

To do this I need a format like'('T34', [(1, 'T35'), (0, 'T37'), (0, 'T36')])'. So you need to hack around the doc IDs somehow.

Could you please sketch how I would do your point 4. and 5. with Matchzoo?

datistiquo commented 5 years ago

@bwanglzu What I thought I need to do is the following:

For a single new query I manually create a file where in each line I have the same query with it word IDS together with two docs (each also with its word IDS). Every line of the same query together with two documents is filled simply from the begining of my docs in steps of two. In this way I will fill the curr_batch from above code customizing to overcome doc IDs?

So, is this thought way doing new predcitions and is it like you meant to create the score list?

This is very hacky and you need this to adapt for every model for testing.

Is this like the way you described in your above answer?

bwanglzu commented 5 years ago

@datistiquo I have no clue what is a relation file and the IDS you mentioned above, can you answer @faneshion @pl8787 @yangliuy ?

datistiquo commented 5 years ago

@bwanglzu Serious? 😄 I am sorry but I thought you know more than me...

Ok then maybe you explain the point of creating the score list like you did (I assume in v2)? Just few words such that I can find out myself. As it is all about of course is about new query with no ID...

bwanglzu commented 5 years ago

@datistiquo see #684

bwanglzu commented 5 years ago

@datistiquo :) I didn't participate the development of 1.0, that's the trick...

yangliuy commented 5 years ago

@datistiquo I think your question maybe relevant to how to generate the relation files for a new query. Firstly you can refer to the script and related functions for generating the relation files here. Then you should understand the differences between the training process and prediction/testing process. I think what you mean by "new query" is the queries in testing process, which are the queries in the testing data. Thus the relation files and IDs for them can be generated using the same functions in here. Finally, the IDs of text in v1 are automatically generated with a help of a hashmap to make sure different texts with the same content have the same IDs. But you can modify the code to fit your requirement. For example, you can reuse the text IDs in your raw data. Step 4 and Step 5 can be done with just the textual content of queries and candidate documents. Text IDs just help process the data in a more convenient way.

The statement "x, y = document, document and latter on predict with x, y = document, query" may have some problem. Where is the label information ? Why do you want to train a ranking model with doc-doc pairs? Are you interested in the matching between long texts?

datistiquo commented 5 years ago

Is it the idea of text mathing like DRRm to get the most relevant documents for a new query? 😄

What I was confused is that prediction is actually equal to testing...and in config file the prediction phase is related to training phase. Prediction is done using text1_corpus and text2_corpus! The intuition would to have there your new data.

I have now a good trick doing real prediciton. You just use the relation file under prediction to create your batches. You use the format of relation with each line have the same new query (with its word ID instead of a doc ID!) and in the last column the doc ID from your doc ID list you can takeover. I suppose you can hack in this way something out...