Closed datistiquo closed 5 years ago
The index id of a datapack is subject to change of any preprocessing process. Also, if you're printing the entire datapack out using datapack.frame()
, it's more likely so since there are a lot of table manipulations in it. It is also possible that the preprocessing process filters some of the data, so the data you're looking for is actually not in the result.
Thanks. Lets say my processing does not delete any line, how are the changes of IDS happens (and where)? I cannot find anything.
Also running each time change the vocab IDs? But I thiunk this is correlated to chnage of IDs before preprocessing.
If the IDs change it is not so easy in prediction to map the predicted documents to your original via ID? How should I do that then? Since I use heavily stemming and removing words, just prininting out the preprocessed doc is a bit unreadable.
I meant "index id" that denotes the number of a specific row is subject to change and you shouldn't assume its order. id_left
and id_right
are the real identifier of samples and datapack.relation
always correctly manage them. If you're asking the detailed behavior of how index ids and id_left/right
are managed, then you have to read the source code. There are reset_index
and groupby
here and there and the exactly what is changed really depends on the preprocessor configuration. We don't care about this very much because it doesn't matter for training as we always shuffle before training.
Hi,
I created my datapack. I printed the first 5 lines to confirm results after prepprocessing with BasicPreprocessor. I recognized that the order of lines and so the IDs changes (The new preprocessed lines corresponds to different IDs than before). But I cannot find the reason. Does preprocessing somehow shuffles the orders?
I would like to have the same order.