tokenization issues for non-ascii texts

EducationalTestingService / rstfinder

Fast Discourse Parser to find latent Rhetorical STructure (RST) in text.

MIT License

123 stars 24 forks source link

tokenization issues for non-ascii texts #28

Open mheilman opened 10 years ago

mheilman commented 10 years ago

The NLTK tokenizer used in the code doesn't handle fancy quotation marks very well. They just end up attached to words rather than being separate tokens.

We should probably either preprocess the input that is passed to the tokenizer, find another tokenizer, or fix the current one.

There may be some issues related to other types of symbols as well.

dan-blanchard commented 10 years ago

And this is why we don't use that internally...

mheilman commented 10 years ago

indeed.

dmnapolitano commented 10 years ago

:+1: Is the issue just the quotation marks or all non-ASCII characters?

mheilman commented 10 years ago

probably a lot of them

mheilman commented 10 years ago

I guess another option would be to use unidecode...

dan-blanchard commented 10 years ago

I guess another option would be to use unidecode...

That's probably unnecessary. There aren't many characters other than quotes, dashes, and ellipses, that would be next to words that you wouldn't want to stay attached.

We could probably just do replacements with a dict like this:

_NON_ASCII = [("\u2026", "..."),  # horizontal ellipsis
              ("\u2018", "`"),    # left single quotation mark
              ("\u2019", "'"),    # right single quotation mark
              ("\u201c", "``"),   # left double quotation mark
              ("\u201d", "''"),   # right double quotation mark
              ("\u2013", "-"),    # en dash
              ("\u00a0", " ")]    # no-break space

dmnapolitano commented 10 years ago

Yeah, that's what I was thinking. If it's choking on "김수진", however, then yeah. 😕

dan-blanchard commented 10 years ago

Well, it's not that it's choking on things, it's just not splitting quotes off.

Handling non-English/ASCII characters is a different issue, since it's not very well defined what the tokenizer should do in those cases.

dan-blanchard commented 10 years ago

It's also relevant that if we ran it through unidecode, it will turn Chinese characters into Pinyin, which may yield English words (e.g., "fan") that could throw off parsing features.

mheilman commented 10 years ago

Hmm, I think the simpler dictionary approach sounds good, but I think that dict above is missing a few things (http://en.wikipedia.org/wiki/Apostrophe).

ghost commented 4 years ago

@desilinguist here is an issue related to tokenizer, but not exactly what we thought.

The NLTK tokenizer does not find the sentence boundaries correctly. For example, one of the edus output of this parser looks like this: ['or', 'maybe', 'a', 'guy', 'never', 'ask', 'a', 'her', 'out.in', 'case', 'of', 'a', 'guy', 'probably', 'the', 'same', 'comments'] where we can see there should be a new sentence starting in case, but did not do so.

It would be better if we can pass in the tokenized input to rst_parse so it does not do the tokenization.