attilanagy234 / neural-punctuator

Complimentary code for our paper Automatic punctuation restoration with BERT models
MIT License
48 stars 7 forks source link

id2target and target2id #10

Closed UsamaI000 closed 3 years ago

UsamaI000 commented 3 years ago

How did you decide id2target for converting back from prediction to original. id2target = {-1: 0, 9: 1, # . 60: 2, # ? 15: 3, # , -2: -1, # will be masked }

I have class 1 as , 2 as ? 3 as . and 4 as ! 0 for all others. How should I use this?

bana513 commented 3 years ago

See our dataset creation notebook. This ordering was our decision, you can have a custom mapping. The back conversion can be done like this:

target_token2id = {t: tokenizer.encode(t)[-2] for t in ".?,"}
# target_token2id = {'.': 1012, '?': 1029, ',': 1010}
# -2 indexing to eliminate BOS, EOS tokens

target_ids = list(target_token2id.values())
# target_ids = [1012, 1029, 1010]

id2target = {
    0: 0,  # used for empty targets
    -1: -1,  # used for interword tokens, will be masked
}
for i, ti in enumerate(target_ids):
    id2target[ti] = i+1
target2id = {value: key for key, value in id2target.items()}

Note: in this example we were using another tokenizer, so the ids are different. Of course, you better serialize these ids as well, so you won't mess up later. However, we hardcoded these values in evaluation step.

UsamaI000 commented 3 years ago

Thanks