Closed Jong-Won closed 6 years ago
@Jong-Won Hi, I encounter the same problem. After I did preprocessing, I can't get the same number of tokens and sentences with that paper. Counld you tell me how to do it? Thank you!
I found that token number is the same as Max's paper while the sentence number is not the same. Besides, I use merged data since the tag file do not include all sections.
@Jong-Won @ZhixiuYe Thanks for your questions. Yes, I used "parsed/mrg" for pos tagger. I am not sure why your number of sentences is not the same as that reported in my paper. I remembered that I followed previous works to process the WSJ data and got the same stats.
Hi, @XuezheMax
Thanks for sharing your code here. The pos tagger is trained by WSJ data from PTB, but I have not found 23-24 in package/treebank_3/tagged/pos/wsj, but I found 0-24 in treebank_3/parsed/mrg/wsj ?
Do you use the "parsed/mrg" for pos tagger ?
Another question is : For merged data in WSJ, I do data statics, but I found the token num is the same as your paper, but the sentence num is diffenrent. Do you have any idea for this ?
sent num of WSJ
train: 36386 test: 5104
Thanks