About long sentences split in train/dev/test at line 541 in file "wmseg_model.py"

SVAIGBA / WMSeg

MIT License

172 stars 42 forks source link

Excuse me!

Can you tell me why you used the line "if char in ['，', '。', '？', '！', '：', '；', '（', '）', '、'] and len(sentence) > 64:" for all train/dev/test set to split long sentences? Whether this process is valid for evaluating Chinese Word Segmentation?

Thanks for your answering.

Thanks for asking.

The reason we split long sentences into short ones is to make to code run faster. You don't have to do this.

Given that punctuations are always natural word boundaries, we think this is valid for evaluating Chinese Word Segmentation.

SVAIGBA / WMSeg

About long sentences split in train/dev/test at line 541 in file "wmseg_model.py" #9