chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

Preprocessing: replace em/en/doubled dash to single dash #274

Closed saippuakauppias closed 4 years ago

saippuakauppias commented 5 years ago

context

Texts often contain different types of dashes, but you need to bring them to one form.

proposed solution

Replace / / / -- to -

bdewilde commented 4 years ago

Hi @saippuakauppias , pardon the late reply. I'm not totally sure this is a good idea, since different dashes mean different things and may depend on context. What's your use case? Are there concerns about changing a text's meaning by normalizing dashes?

saippuakauppias commented 4 years ago

My case is minimization symbols in text for better training ML model. Maybe its only need for me, I dont know :)

bdewilde commented 4 years ago

Hi @saippuakauppias , on reflection, I think there's no good, general-purpose solution here, since the precise meaning of dashes depends so much on context and personal preference, and forcing a standard form could easily mangle meanings. So, I'd rather leave it to users depending on their particular needs. Yours might be met by re.sub(r"(—|–|-{2,})", "-", text).