CharsetDetector / UTF-unknown

Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+
300 stars 43 forks source link

notepad-plus-plus revert "Update uchardet to 0.0.6 ..." #80

Open rstm-sf opened 4 years ago

rstm-sf commented 4 years ago

Hello!

notepad-plus-plus revert "Update uchardet to 0.0.6 to improve UTF-8 detection quality" -- notepad-plus-plus/notepad-plus-plus#5414

We need to look and fix the corresponding changes, since after #52 we got a lot in common (now #74 looks like the beginning of a solution)

rstm-sf commented 4 years ago

in short words: it works as expected on all operating systems, except windows @MetaChuh https://github.com/notepad-plus-plus/notepad-plus-plus/pull/5414#issuecomment-472548342

304NotModified commented 4 years ago

I don't see how #74 helps on this, but it would be nice if we haven't this issue.

Reverting the uchardet changes (#52) sounds like a bad idea anyway.

in short words: it works as expected on all operating systems, except windows

That's horrible! .NET is 90% Windows?

Any idea how to start fixing this?

rstm-sf commented 4 years ago

Reverting the uchardet changes (#52) sounds like a bad idea anyway.

I didn’t mean to cancel, but to try to improve on the basis of the knowledge gained :)

.NET is 90% Windows?

I think there have been changes breaking the encodings from the win32 API (sorry, but I don’t know how they got the encoding). For example, https://github.com/alberto-dev/notepad-plus-plus/commit/a504ebba54c41309f42006f8d82ecea435085731#diff-ada290d05258a2a91d5a3e19690f89acL340

Perhaps there were some other changes, in addition to the names of the encodings, which affected badly

Any idea how to start fixing this?

Start by correcting the encoding names: #75. And then how it goes :)

rstm-sf commented 4 years ago

Maybe this issue resolve #76 (see https://github.com/alberto-dev/notepad-plus-plus/commit/a504ebba54c41309f42006f8d82ecea435085731#diff-18d581d96114cd69e207975bf1c4fa43L249)

rstm-sf commented 4 years ago

Take a look. Thus, new encoding detections were deleted (https://github.com/notepad-plus-plus/notepad-plus-plus/pull/5414/commits/9a39fafd335f2e1e5af4b5a3251c7cd961ee5fe9#diff-7c6715d4fafa723d6682f3b295c32875L82)

This allowed us to discard cases when the same metrics arise (https://github.com/CharsetDetector/UTF-unknown/issues/77#issuecomment-573397518)