NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.84k stars 898 forks source link

v2: datapack IDs after preprocessing change #697

Closed datistiquo closed 5 years ago

datistiquo commented 5 years ago

Hi,

I created my datapack. I printed the first 5 lines to confirm results after prepprocessing with BasicPreprocessor. I recognized that the order of lines and so the IDs changes (The new preprocessed lines corresponds to different IDs than before). But I cannot find the reason. Does preprocessing somehow shuffles the orders?

I would like to have the same order.

uduse commented 5 years ago

The index id of a datapack is subject to change of any preprocessing process. Also, if you're printing the entire datapack out using datapack.frame(), it's more likely so since there are a lot of table manipulations in it. It is also possible that the preprocessing process filters some of the data, so the data you're looking for is actually not in the result.

datistiquo commented 5 years ago

Thanks. Lets say my processing does not delete any line, how are the changes of IDS happens (and where)? I cannot find anything.

Also running each time change the vocab IDs? But I thiunk this is correlated to chnage of IDs before preprocessing.

If the IDs change it is not so easy in prediction to map the predicted documents to your original via ID? How should I do that then? Since I use heavily stemming and removing words, just prininting out the preprocessed doc is a bit unreadable.

uduse commented 5 years ago

I meant "index id" that denotes the number of a specific row is subject to change and you shouldn't assume its order. id_left and id_right are the real identifier of samples and datapack.relation always correctly manage them. If you're asking the detailed behavior of how index ids and id_left/right are managed, then you have to read the source code. There are reset_index and groupby here and there and the exactly what is changed really depends on the preprocessor configuration. We don't care about this very much because it doesn't matter for training as we always shuffle before training.