dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

strip_accents should be None by default in WordPiece #1528

Open sxjscience opened 3 years ago

sxjscience commented 3 years ago

Description

@leezu @szha @xinyual I noticed that we may need to set strip_accents to None in https://github.com/dmlc/gluon-nlp/blob/223f1f6f8e267d258abd2f299ec6fc4a9b2f1cf8/src/gluonnlp/data/tokenizers/huggingface.py#L564 so that it will be turned on when lowercase is True.

This may impact the performance.

Error Message

(Paste the complete error message, including stack trace.)

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

1. 2.

What have you tried to solve it?

1. 2.

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here
sxjscience commented 3 years ago

However, accents may have certain meanings for lots of languages, e.g., mochte vs. möchte. Thus, we may try to turn it off in nlp_process.

leezu commented 3 years ago

Thus, we may try to turn it off in nlp_process.

Do you mean exposing an option in nlp_process or changing the defaults in nlp_process? As English is a special case that doesn't care much about accents, I suggest we must keep the option to keep accents in nlp_process.

sxjscience commented 3 years ago

Yes, I mean to always explicitly set “strip_accent” to False in nlp_process.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Leonard Lausen notifications@github.com Sent: Monday, February 22, 2021 7:37:12 AM To: dmlc/gluon-nlp gluon-nlp@noreply.github.com Cc: Xingjian SHI xshiab@connect.ust.hk; Author author@noreply.github.com Subject: Re: [dmlc/gluon-nlp] strip_accents should be None by default in WordPiece (#1528)

Thus, we may try to turn it off in nlp_process.

Do you mean exposing an option in nlp_process or changing the defaults in nlp_process? As English is a special case that doesn't care much about accents, I suggest we must keep the option to keep accents in nlp_process.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/dmlc/gluon-nlp/issues/1528#issuecomment-783461555, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABHQH3VYVYGTNOKCD23YH2LTAJ22RANCNFSM4X76IM4Q.