Closed saippuakauppias closed 4 years ago
Hi @saippuakauppias , pardon the late reply. I'm not totally sure this is a good idea, since different dashes mean different things and may depend on context. What's your use case? Are there concerns about changing a text's meaning by normalizing dashes?
My case is minimization symbols in text for better training ML model. Maybe its only need for me, I dont know :)
Hi @saippuakauppias , on reflection, I think there's no good, general-purpose solution here, since the precise meaning of dashes depends so much on context and personal preference, and forcing a standard form could easily mangle meanings. So, I'd rather leave it to users depending on their particular needs. Yours might be met by re.sub(r"(—|–|-{2,})", "-", text)
.
context
Texts often contain different types of dashes, but you need to bring them to one form.
proposed solution
Replace
—
/―
/–
/--
to-