在处理中英文混合文本时，每个word对应的start_pos, end_pos的处理有错误？

NLPInBLCU / BiaffineDependencyParsing

BERT+Self-attention Encoder ; Biaffine Decoder ; Pytorch Implement

MIT License

73 stars 17 forks source link

在处理中英文混合文本时，每个word对应的start_pos, end_pos的处理有错误？ #13

Open erichuazhou opened 4 years ago

erichuazhou commented 4 years ago

在_get_words_start_end_pos函数中，是根据len(w)的叠加来递增start_pos和end_pos的。但是，当w是英文时，bert是采用wordPieces算法分词的，而不是按字母来分词的。这会导致 _get_words_start_end_pos中的w的长度(e.g. len('159mm')=6) 与 convert_examples_to_features 中w的长度(e.g. len([ 15, , 9, #mm ])=4)不一致。因此，在数据处理阶段的start_pos和end_pos就是有问题的。也有可能是我分析错了。请帮忙看一看。 @LiangsLi

LiangsLi commented 4 years ago

你好 @erichuazhou 我确认了一下, 这的确是一个BUG. 现在的输入实现写的过分复杂和丑陋了, 后面会将tokenizer传入, 确保获得正确的词语边界和最大句长. 不过我最近在忙毕业项目, 而且这部分需要改的地方很多, 不确定何时能改完. 感谢你帮忙看代码~