Suggestion: Refine Word Matching Regex

davidlday commented 6 years ago

The current regex uses a pretty liberal expression:

/\S+/g

I propose a more constrained expression that only matches on word characters + apostrophes:

/[\w’']+/g

I believe this more closely represents what an editor / publisher means by word count and may also resolve #88 on excluding spaces from character count without adding a config item, as well as #55 on counting markdown syntax as words. And maybe this is what was meant by #2 a well?

Be happy to submit a PR but wanted to run this by you first.

OleMchls commented 6 years ago

I like the idea, and yes, this is what was meant with #2. Could you give me an example how the apostrophes make this regex more accurate than simply /\w+/g?

davidlday commented 6 years ago

@OleMchls - Cool, and sure! The apostrophes account for contractions (at least in English). The \w expression expands to [A-Za-z0-9_], which will still count words like don't as two words. In some academic cases, contractions count as two words. IIRC, a couple of NLP tokenizers I've worked with in the past behave this way, but I believe the intent here is more along the lines of how word processors behave. Contractions count as one word, not two. What do you think?

keelanfh commented 6 years ago

If \w does just expand to [A-Za-z0-9_] then surely this would cause problems for accented characters, etc.

For instance, the French word à would not be counted at all, and fête would be counted as two words.

Not sure if that's what \w actually expands to. Would be good to test. I just tried it out in Atom's find and replace interface and it suggests that is what it does.

davidlday commented 6 years ago

@keelanfh - Excellent point. I found a couple of references to patterns that might work better:

Maybe this will work:

/[\w’'\u00C0-\u017F]+/g

I haven't tested yet, but will try to do so this weekend.

Thoughts?

keelanfh commented 6 years ago

Then that would run into the problem of other non-Latin scripts (e.g. Arabic) not being counted.

I'm just not sure exactly what the problem is that we're trying to solve. If there's an issue with Markdown syntax maybe it should just be fixed with something like

/[^\s#]+/g

?

davidlday commented 6 years ago

The problems are listed above: #88 and #55. The goal is to increase accuracy without adding any new settings.

If we can assume Markdown only, then we could use something like the Remove Markdown package and use the original regex to count what's left. I kind of like that idea, but I don't think the intent was to be Markdown-specific either.

What counts as a word isn't always so simple, apparently. How accurate does this need to be? Maybe it's accurate enough as is and if a user needs something more accurate then they'll have to do something else.

keelanfh commented 6 years ago

Yeah, I think it’s quite complicated. In order to make it more accurate we’d need to decide what we actually want the word count to look like.

davidlday commented 6 years ago

Maybe a better approach would be to adopt an existing word count package and implement filters for the various types of files (markdown, html, etc) and then run the results through an existing word count package. There's no shortage of them: https://www.npmjs.com/search?q=word%20count

This would relieve this package from the responsibility of defining what a word is and developing tests to validate. Thoughts?

OleMchls commented 6 years ago

@davidlday @keelanfh first of all, thanks for your involvement <3 And special sorry to @davidlday for forgetting to follow up on your Feb. comment.

Maybe a better approach would be to adopt an existing word count package and implement filters for the various types of files (markdown, html, etc) and then run the results through an existing word count package. There's no shortage of them: https://www.npmjs.com/search?q=word%20count

This would relieve this package from the responsibility of defining what a word is and developing tests to validate. Thoughts?

Do you have a specific one you would recommend? I was scanning the list, but none of them really stood out to me. But generally, I do like the idea, given how complex the realm of word counting actually is.

Maybe in the meantime go with a more refined regex as you suggested.

For #55 there is another idea discussed in https://github.com/OleMchls/atom-wordcount/issues/65 which I also like; having different count functions per language extension.

davidlday commented 6 years ago

@OleMchls no worries! wordcount caught my eye because it supports English, CJK, and Cyrillic. Digging down through its dependencies to word-regex, the pattern it uses is:

/[a-zA-Z0-9_\u0392-\u03c9\u0400-\u04FF]+|[\u4E00-\u9FFF\u3400-\u4dbf\uf900-\ufaff\u3040-\u309f\uac00-\ud7af\u0400-\u04FF]+|[\u00E4\u00C4\u00E5\u00C5\u00F6\u00D6]+|\w+/g

So maybe the place to start is by leveraging word-regex?

OleMchls / atom-wordcount

Suggestion: Refine Word Matching Regex #91