dreasysnail / POINTER

MIT License
112 stars 19 forks source link

about constructing data #9

Open NLPCode opened 3 years ago

NLPCode commented 3 years ago

I think there is an error at line 444 of generate_training_data.py. It should be: tokens = tokenizer.tokenize(line)

dreasysnail commented 3 years ago

Thanks for pointing out! We will check this out.

guoyinwang commented 3 years ago

Thanks for pointing this out. As the POS part requires word instead of subword, we do shortcut here to use split instead of tokenizer to avoid further matching index between word and subword. We will try to correct this inlanders later version

etrigger commented 3 years ago

@dreasysnail @guoyinwang In case we use WORD to split text when we prepare the training data. During training process, I want to use subwords to encode the text. How do we align the pair of text for training.