MycroftAI / lingua-franca

Mycroft's multilingual text parsing and formatting library
Apache License 2.0
73 stars 77 forks source link

Normalizer mishandles "X%.", returns "X %." #196

Open ChanceNCounter opened 3 years ago

ChanceNCounter commented 3 years ago

normalize("Set Volume to 50%.") -> "Set Volume to 50 %."

This is bad. It should probably, at worst, return "Set Volume to 50 % ."

Badboy-16 commented 3 years ago

Hi @ChanceNCounter I would like to work on this issue. As this would be my first contribution to this project, I'll complete the steps required to become a contributor and submit a PR shortly. :)

ChanceNCounter commented 3 years ago

Sounds good! I think it should ideally maintain the percentage as such, meaning that when the normalized phrase is passed to a tokenizer, one of the tokens should be "50%". But that's my opinion.

In the long run, the oddness of the current behavior aside, there might be a design choice to be made here: @krisgesling, what are your thoughts on the extractors and percentages?

krisgesling commented 3 years ago

Yeah agreed - the % is inherently tied to the number eg it's not the same as "50 apples", if anything it's closer to "0.5".

Thanks for digging into this @Badboy-16 :)

JarbasAl commented 3 years ago

since the point of normalize was making intent parsing etc easier, this just makes it harder to detect numbers or percentages, eg, a voc file containing "percent" and "%" will no longer match in adapt, any downstream that is depending on tokens being number words might also suddenly fail

this change was intentionally part of normalization process

ChanceNCounter commented 3 years ago

this change was intentionally part of normalization process

Okay but the current state of affairs is unacceptable.

JarbasAl commented 3 years ago

then normalize the symbol into a word

ChanceNCounter commented 3 years ago

I think we might be talking about different things here. The periods in the issue title are literal.

The normalizer handles "5%" correctly. It mishandles "5%.", returning "5 %."

"%." is nothing.