SVAIGBA / WMSeg

MIT License
172 stars 42 forks source link

About long sentences split in train/dev/test at line 541 in file "wmseg_model.py" #9

Closed vuraemon closed 3 years ago

vuraemon commented 3 years ago

Excuse me!

Can you tell me why you used the line "if char in [',', '。', '?', '!', ':', ';', '(', ')', '、'] and len(sentence) > 64:" for all train/dev/test set to split long sentences? Whether this process is valid for evaluating Chinese Word Segmentation?

Thanks for your answering.

yuanheTian commented 3 years ago

Excuse me!

Can you tell me why you used the line "if char in [',', '。', '?', '!', ':', ';', '(', ')', '、'] and len(sentence) > 64:" for all train/dev/test set to split long sentences? Whether this process is valid for evaluating Chinese Word Segmentation?

Thanks for your answering.

Thanks for asking.

The reason we split long sentences into short ones is to make to code run faster. You don't have to do this.

Given that punctuations are always natural word boundaries, we think this is valid for evaluating Chinese Word Segmentation.