English tokenizer issues

bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit

https://bnosac.github.io/udpipe/en

Mozilla Public License 2.0

209 stars 33 forks source link

English tokenizer issues #25

Closed cainesap closed 6 years ago

cainesap commented 6 years ago

Hello,

Firstly, thank you for the great resource! I've found it very useful in my work.

The reason I'm writing is some strange tokenizing errors I'm seeing with the UD model for English (english-ud-2.0-170801.udpipe)

For instance, "gets" and "figures" are being split as if it's apostrophe-s possessive. (get + s, figure + s)

I know I can build my own models, so I will look at the documentation in order to do that. But I wondered if this error type rings any bells: do you think I should be looking at the training data or parameter settings first?

thank you! Andrew

jwijffels commented 6 years ago

Possibly same question as #17 Can you provide an example sentence where you find that feature appearing when doing the annotation?

cainesap commented 6 years ago

Yes, here's the example when "figures" is split in two .. (it's a transcript of spoken English, hence the odd word order and spelled out numbers). Concerning fax machines the sales figures decreased between nineteen ninety five and twenty ten.

cainesap commented 6 years ago

Sorry no, my bad, the sentence where this happens is longer than that, has a second occurrence of "sales figures" (the one which tokenizes figure+s) and has lots of hesitation markers ("er", "um", etc). What I can do is break the sentence down into smaller chunks and tokenization is ok. That's a way forward. Thank you for your time!