CharsetDetector / UTF-unknown

Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+
307 stars 46 forks source link

Add detect encoding with BOM: UTF-7 and GB-18030 #98

Closed rstm-sf closed 4 years ago

rstm-sf commented 4 years ago

Resolve #79

Add detect

rstm-sf commented 4 years ago

Simplification of checks in FindCharSetByBom because len <= buf.Length always

So in the end it is called from the following places https://github.com/CharsetDetector/UTF-unknown/blob/cb3dca2a51d4ccf769687cd18eebd12d60e8874b/src/CharsetDetector.cs#L142

https://github.com/CharsetDetector/UTF-unknown/blob/cb3dca2a51d4ccf769687cd18eebd12d60e8874b/src/CharsetDetector.cs#L199-L201

via

https://github.com/CharsetDetector/UTF-unknown/blob/cb3dca2a51d4ccf769687cd18eebd12d60e8874b/src/CharsetDetector.cs#L269

https://github.com/CharsetDetector/UTF-unknown/blob/cb3dca2a51d4ccf769687cd18eebd12d60e8874b/src/CharsetDetector.cs#L283

https://github.com/CharsetDetector/UTF-unknown/blob/cb3dca2a51d4ccf769687cd18eebd12d60e8874b/src/CharsetDetector.cs#L297-L299

rstm-sf commented 4 years ago

It's bug or feature?

rstm-sf commented 4 years ago

At least add verification will be more efficient

rstm-sf commented 4 years ago

It seemed like it would be better, because otherwise, you need to add a length check every time

304NotModified commented 4 years ago

Thanks for the refactor also :)