Open mheilman opened 10 years ago
And this is why we don't use that internally...
indeed.
:+1: Is the issue just the quotation marks or all non-ASCII characters?
probably a lot of them
I guess another option would be to use unidecode...
I guess another option would be to use unidecode...
That's probably unnecessary. There aren't many characters other than quotes, dashes, and ellipses, that would be next to words that you wouldn't want to stay attached.
We could probably just do replacements with a dict like this:
_NON_ASCII = [("\u2026", "..."), # horizontal ellipsis
("\u2018", "`"), # left single quotation mark
("\u2019", "'"), # right single quotation mark
("\u201c", "``"), # left double quotation mark
("\u201d", "''"), # right double quotation mark
("\u2013", "-"), # en dash
("\u00a0", " ")] # no-break space
Yeah, that's what I was thinking. If it's choking on "κΉμμ§", however, then yeah. π
Well, it's not that it's choking on things, it's just not splitting quotes off.
Handling non-English/ASCII characters is a different issue, since it's not very well defined what the tokenizer should do in those cases.
It's also relevant that if we ran it through unidecode, it will turn Chinese characters into Pinyin, which may yield English words (e.g., "fan") that could throw off parsing features.
Hmm, I think the simpler dictionary approach sounds good, but I think that dict above is missing a few things (http://en.wikipedia.org/wiki/Apostrophe).
@desilinguist here is an issue related to tokenizer, but not exactly what we thought.
The NLTK tokenizer does not find the sentence boundaries correctly. For example, one of the edus output of this parser looks like this:
['or', 'maybe', 'a', 'guy', 'never', 'ask', 'a', 'her', 'out.in', 'case', 'of', 'a', 'guy', 'probably', 'the', 'same', 'comments']
where we can see there should be a new sentence starting in case
, but did not do so.
It would be better if we can pass in the tokenized input to rst_parse
so it does not do the tokenization.
The NLTK tokenizer used in the code doesn't handle fancy quotation marks very well. They just end up attached to words rather than being separate tokens.
We should probably either preprocess the input that is passed to the tokenizer, find another tokenizer, or fix the current one.
There may be some issues related to other types of symbols as well.