CharsetDetector / UTF-unknown

Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+
307 stars 46 forks source link

Update README, resolve some todo and add tests #99

Closed rstm-sf closed 4 years ago

rstm-sf commented 4 years ago

Resolve #78

What do you need to do for this?

rstm-sf commented 4 years ago

See SO for X-ISO-10646-UCS-4-3412 and X-ISO-10646-UCS-4-2143

rstm-sf commented 4 years ago

VISCII is an unofficially-defined modified ASCII character encoding for using the Vietnamese language with computers.

rstm-sf commented 4 years ago

euc-tw, iso-8859-10, iso-8859-16

Perhaps help answer in dotnet/corefx#42623

rstm-sf commented 4 years ago

euc-tw, iso-8859-10, iso-8859-16

Perhaps help answer in dotnet/corefx#42623

It's not supported

rstm-sf commented 4 years ago

@304NotModified, hello!

* (Optional) to get rid of `try / catch` in ctor DetectionDetail

It looks like it will have to be left. Because, I was so glad that you can support other encodings in Core that I forgot about Linux :( And it seems that the user himself must write similar code before using the library

Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
rstm-sf commented 4 years ago

It looks like this works for Ubuntu too

$ dotnet test
Test run for UTF-unknown/tests/bin/Debug/netcoreapp3.0/UtfUnknown.Tests.dll(.NETCoreApp,Version=v3.0)
Microsoft (R) Test Execution Command Line Tool Version 16.3.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...

A total of 1 test files matched the specified pattern.

Test Run Successful.
Total tests: 91
     Passed: 91
 Total time: 2.4248 Seconds
304NotModified commented 4 years ago

We could also run the tests on Travis CI (mono/net core) on Mac or Linux.

What do you think?

304NotModified commented 4 years ago

(maybe azure devops also supports Linux)

rstm-sf commented 4 years ago

What do you think?

To be honest, I'm not good at it :) Perhaps it’s worth it, since it seems to me that the encodings are platform-dependent

I just thought that if definitions are available, are implementations on different OS available? Because it seemed to me that Codepage calls win32api and I got confused in the documentation

And the main question I’m thinking about now is whether we should add additional encodings, or should this be left to the discretion of the user? Which approach is better?

rstm-sf commented 4 years ago

(maybe azure devops also supports Linux)

By the way, I recently left a link to bookmarks for review

304NotModified commented 4 years ago

And the main question I’m thinking about now is whether we should add additional encodings, or should this be left to the discretion of the user? Which approach is better?

I prefer the first :)

rstm-sf commented 4 years ago

Add batch-file test for unsupported encoding

Also, it would be necessary to highlight tests for encodings associated with this dictionary https://github.com/CharsetDetector/UTF-unknown/blob/303f0243024bd24fc1733dbcc47468ddf0eebf1e/src/DetectionDetail.cs#L18-L24

But this is another task

rstm-sf commented 4 years ago
 (Optional) to get rid of `try / catch` in ctor DetectionDetail

This cannot be done because #100

rstm-sf commented 4 years ago

@304NotModified, what do you think, is it worth pointing out in README that the result may turn out different when detected based on bytes? https://github.com/CharsetDetector/UTF-unknown/issues/38#issuecomment-506960736

In my opinion, we need to determine one algorithm for all options, but this will change the result for some

rstm-sf commented 4 years ago

It seems that everything is ready, the following questions remained:

because #100

@304NotModified, what do you think, is it worth pointing out in README that the result may turn out different when detected based on bytes? #38 (comment)

rstm-sf commented 4 years ago

Build FAILED.

Ok.

rstm-sf commented 4 years ago

It seems that everything is ready, the following questions remained:

@304NotModified, what do you think, is it worth pointing out in README that the result may turn out different when detected based on bytes? #38 (comment)

It’s probably better to consider it separately.

304NotModified commented 4 years ago

Looks good to me, so merged :)