Emurasoft / WordCount

Displays the number of characters, words, lines, or other items in the document or selection.
https://www.emeditor.com/
MIT License
2 stars 1 forks source link

error with such pattern #2

Closed wisim closed 4 years ago

wisim commented 4 years ago

[\x{20000}-\x{2A6D6}\x{2A700}-\x{2B734}\x{2B820}-\x{2CEA1}\x{2CEB0}-\x{2EBE0}\x{30000}-\x{3134A}]

MakotoE commented 4 years ago

Thank you for the issue. I don't understand what is the error with this pattern. I just copy+pasted the string to EmEditor and ran WordCount and saw no issue. Please elaborate.

wisim commented 4 years ago

Thank you for the issue. I don't understand what is the error with this pattern. I just copy+pasted the string to EmEditor and ran WordCount and saw no issue. Please elaborate.

This pattern is for counting cjk extension of Unicode.

And the issue is , it can't work with these code points.

Characters like 𠀁 will not be matched, and WordCount will count all characters of the selected text.

MakotoE commented 4 years ago

I'm having trouble understanding because I don't know what programming language this is.

[\x{20000}-\x{2A6D6}\x{2A700}-\x{2B734}\x{2B820}-\x{2CEA1}\x{2CEB0}-\x{2EBE0}\x{30000}-\x{3134A}]

If I enter "𠀀𪛖", I get

Characters  2   
Width   4   
Words   0   
Lines   1   
View Lines  1   
Pages   1   

Which is correct.

Also, which EmEditor version did you use?

wisim commented 4 years ago

I'm having trouble understanding because I don't know what programming language this is.

[\x{20000}-\x{2A6D6}\x{2A700}-\x{2B734}\x{2B820}-\x{2CEA1}\x{2CEB0}-\x{2EBE0}\x{30000}-\x{3134A}]

If I enter "𠀀𪛖", I get

Characters    2   
Width 4   
Words 0   
Lines 1   
View Lines    1   
Pages 1   

Which is correct.

Also, which EmEditor version did you use?

Oh, sorry, for correction: The problem is that If I enter 𠀀𪛖 jkjlkj, I get

Characters  9   

My versionis 20.1.0.

MakotoE commented 4 years ago

That should be correct, though EmEditor gives the wrong character count on the status bar.

I asked Yutaka and he said that EmEditor counts UTF-16 surrogate pairs as two characters. WordCount counts a surrogate pair as one character. As of now, EmEditor's behavior will not change.