ageitgey / node-unfluff

Automatically extract body content (and other cool stuff) from an html document
Apache License 2.0
2.15k stars 221 forks source link

Purpose of removing periods immediately followed by letters? #69

Open knod opened 7 years ago

knod commented 7 years ago

This line has been giving me unexpected issues and I'm thinking of removing it: https://github.com/ageitgey/node-unfluff/blob/master/src/formatter.coffee#L71 (txt = txt.replace(/(\w+\.)([A-Z]+)/, '$1 $2')). What is its purpose? What side-effects might I get from removing it?

The issue it gives me right now is with initialisms, like C.R.T. - they get separated into words and I get "C.", "R.", and "T.". Not sure how else to solve this issue.