Closed UsamaI000 closed 3 years ago
See our dataset creation notebook. This ordering was our decision, you can have a custom mapping. The back conversion can be done like this:
target_token2id = {t: tokenizer.encode(t)[-2] for t in ".?,"}
# target_token2id = {'.': 1012, '?': 1029, ',': 1010}
# -2 indexing to eliminate BOS, EOS tokens
target_ids = list(target_token2id.values())
# target_ids = [1012, 1029, 1010]
id2target = {
0: 0, # used for empty targets
-1: -1, # used for interword tokens, will be masked
}
for i, ti in enumerate(target_ids):
id2target[ti] = i+1
target2id = {value: key for key, value in id2target.items()}
Note: in this example we were using another tokenizer, so the ids are different. Of course, you better serialize these ids as well, so you won't mess up later. However, we hardcoded these values in evaluation step.
Thanks
How did you decide id2target for converting back from prediction to original. id2target = {-1: 0, 9: 1, # . 60: 2, # ? 15: 3, # , -2: -1, # will be masked }
I have class 1 as , 2 as ? 3 as . and 4 as ! 0 for all others. How should I use this?