RadLikeWhoa / Countable

Add live paragraph-, word- and character-counting to an HTML element.
https://sacha.me/Countable
MIT License
1.64k stars 134 forks source link

Remove punctuation before counting words #1

Closed freqdec closed 11 years ago

freqdec commented 11 years ago

Hi, great script!

It would be great if text like "Bonjour !" wasn't counted as two words. You would have to pass the string through a regExp that removed common punctuation characters before splitting into words.

Also, line 66 can be rewritten without the .split i.e. from this:

characters: str ? str.replace(/\s/g, '').split('').length : 0

to this:

characters: str ? str.replace(/\s/g, '').length : 0

Again, great script - apologies for me being pedantic about details like this!

RadLikeWhoa commented 11 years ago

Hi Brian,

thanks for the nice feedback, great to hear. I've already replaced line 66 (don't know what went through my head there), but I can't quite figure out the correct RegEx to get rid of the punctuation.

Which characters would be affected, anyway?

And don't worry about being pedantic, it's great to have another eye on the code to spot the little things. :)

freqdec commented 11 years ago

Hi Sacha,

Thinking further, the punctuation regExp will have to change according to language i.e. the regExp for the Spanish language will contain characters not necessary in the English language (inverted question mark for example).

It may be possible to create an uber regExp that covers most languages but you will never keep everyone happy! Here's a most terrible attempt at something that might work:

/['";:,.\/?¿-!¡]/g

Good Luck!

epmatsw commented 11 years ago

If you want to remove punctuation, I think a better regex would be something like str.replace(/[^A-Za-z0-9 ]/g, ''). That should remove anything that's not a space, number, or letter.

On the other hand, I don't think that this is something that would be desirable. It's not very intuitive, and it makes it so that count.js output doesn't match Microsoft Word's count, which would probably be the standard you'd want to follow.

freqdec commented 11 years ago

Hi Will, your regExp will fail dramattically on any language that has accented characters.

RadLikeWhoa commented 11 years ago

Not entirely sure, but I think it would only matter if a character is preceded by a space (e.g. question mark or exclamation point in French). Likte that, wouldn't it actually be save to just remove those characters (plus the space), wherever needed?

freqdec commented 11 years ago

You are right! So this might work - looks for a space before a punctuation character and replaces them both...

.replace(/\s['";:,.\/?¿-!¡]/g, '').split(/\s).length

epmatsw commented 11 years ago

Ah yeah, don't know what I was thinking really. Still, I think the Microsoft Word question is valid. A solitary punctuation is also treated as a word by wc. I don't think getting different results from both of those is a good idea.

wordcount

Screen Shot 2013-03-14 at 8 05 49 AM

RadLikeWhoa commented 11 years ago

Just tested how some other tools treat this situation. Google Docs, Drafts for iOS and iA Writer all ignore the punctuation and count your example as three words. I think it would be better to follow the lead of more recent projects like the aforementioned. I'll look into it later.

epmatsw commented 11 years ago

Well, that makes sense. I guess it's up to you haha.