WSJ corpus preprocessing

LiyuanLucasLiu / LM-LSTM-CRF

Empower Sequence Labeling with Task-Aware Language Model

http://arxiv.org/abs/1709.04109

Apache License 2.0

846 stars 207 forks source link

WSJ corpus preprocessing #26

Closed ZhixiuYe closed 6 years ago

ZhixiuYe commented 6 years ago

Hi, I have got the treebank_3\tagged\pos\wsj corpus. But after I process this corpus to conll format, I get sentence numbers of train, dev and test 37544, 5642 and 6540, which is not consistent to your paper. I wonder what you have done to preprocess the wsj porpus. Thank you!

LiyuanLucasLiu commented 6 years ago

Hi, thanks for asking. I guess you might got the wrong version of the WSJ-PTB corpus. I just counted the sentence number for the training set, and i believe the reported number is right (also, it's consistent with other papers).