WSJ data problem - Githubissues

XuezheMax / NeuroNLP2

Deep neural models for core NLP tasks (Pytorch version)

GNU General Public License v3.0

441 stars 89 forks source link

WSJ data problem #5

Closed Jong-Won closed 6 years ago

Jong-Won commented 7 years ago

Hi, @XuezheMax

Thanks for sharing your code here. The pos tagger is trained by WSJ data from PTB, but I have not found 23-24 in package/treebank_3/tagged/pos/wsj, but I found 0-24 in treebank_3/parsed/mrg/wsj ?

Do you use the "parsed/mrg" for pos tagger ?

Another question is : For merged data in WSJ, I do data statics, but I found the token num is the same as your paper, but the sentence num is diffenrent. Do you have any idea for this ?

sent num of WSJ

train: 36386 test: 5104

Thanks

ZhixiuYe commented 6 years ago

@Jong-Won Hi, I encounter the same problem. After I did preprocessing, I can't get the same number of tokens and sentences with that paper. Counld you tell me how to do it? Thank you!

Jong-Won commented 6 years ago

I found that token number is the same as Max's paper while the sentence number is not the same. Besides, I use merged data since the tag file do not include all sections.

XuezheMax commented 6 years ago

@Jong-Won @ZhixiuYe Thanks for your questions. Yes, I used "parsed/mrg" for pos tagger. I am not sure why your number of sentences is not the same as that reported in my paper. I remembered that I followed previous works to process the WSJ data and got the same stats.