Closed Oktai15 closed 4 weeks ago
I aslo met similar behavior:
text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen Sinnesbereichen.in allen Sinnen."
normalized_text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen S i n n e s b e r e i c h e n punkt in allen Sinnen."
I aslo met similar behavior:
text=
"Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen Sinnesbereichen.in allen Sinnen."
normalized_text="Das gibt uns Perspektive, Flexibilität, Optimismus, Engagement und Pluralität in allen S i n n e s b e r e i c h e n punkt in allen Sinnen."
The above is expected behavior. The normalizer assumes that consecutive sentences are separated by a period and at least one whitespace. The string quoted above comprises two clauses separated by a period without whitespaces. Adding a whitespace after the period induces correct normalization.
Hi!
I found a bug in English normalization. The following code is applied:
text=
Here is mail.nasa.gov.
norm_text=Here is mail dot nasa dot gov dot
expected output=Here is mail dot nasa dot gov.
Similar bug can be reached in German normalization. The following code is applied:
text=
Here is brettspielversand.de.
norm_text=Here is b r e t t s p i e l v e r s a n d punkt de punkt
expected output=Here is brettspielversand punkt de.
Similar problem with text=
KIM.com-Specials.
. I got same problem with website in text on Spanish and Italian.I also found a specific bug in Spanish normalization. The following code is applied:
text=
El texto de Li Qin en este libro ahora está disponible en forma de libro electrónico.
norm_text=El texto de quincuagésimo primero Qin en este libro ahora está disponible en forma de libro electrónico.
Not sure what is expected output, but current norm_text looks not okay.