CharsetDetector / UTF-unknown

Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+
303 stars 45 forks source link

UTF-8 with emojis detected as pure ascii with 100% confidence #161

Open piranna opened 1 year ago

piranna commented 1 year ago

I think here there are two bugs:

  1. a pure ascii string (0x00-0x7F) is also a valid UTF-8 string, so it should detect both of them, if not with a 100% confidence maybe a 99% for the UTF-8 case to give priority to the ascii one
  2. if text has emojis or any code sequence outside of the ones of pure ascii, definitely it's NOT a pure ascii string