amake / flutter_charset_detector

Flutter plugin that detects the charset (encoding) of text bytes
https://pub.dev/packages/flutter_charset_detector
18 stars 11 forks source link

Cannot parse a lot of japanese articles (CHARSET: SHIFT_JIS) #7

Closed And96 closed 8 months ago

And96 commented 8 months ago

e.g: https://news.kakaku.com/prdnews/cd=pc/ctcd=0171/id=138700/

No problem with Chrome/Firefox and with Postman

Work fine with others sites/languages.

When parse http body response from some japanese websites I cannot get string decode

CODE

...
Future<Response> response = await get(Uri.parse(url))
...
var encodedResponse =
            await CharsetDetector.autoDecode(response.bodyBytes);
        stringResponse = encodedResponse.string;
And96 commented 8 months ago

immagine

And96 commented 8 months ago

immagine

And96 commented 8 months ago

Expected result immagine

Tested on both Windows and Android

On Windows I got: flutter: Caught error: MissingPluginException(No implementation found for method autoDecode on channel flutter_charset_detector)

On Android i got I/flutter ( 8890): Caught error: PlatformException(DetectionFailed, The charset could not be detected, null, null)

And96 commented 8 months ago

Using the other library "charset_converter", it works fine. (On both Win+Android)

stringResponse = await CharsetConverter.decode('Shift_JIS', response.bodyBytes);

amake commented 8 months ago

On Windows I got: flutter: Caught error: MissingPluginException(No implementation found for method autoDecode on channel flutter_charset_detector)

This plugin doesn't support Windows.

I/flutter ( 8890): Caught error: PlatformException(DetectionFailed, The charset could not be detected, null, null)

The underlying Android detector is implemented differently and simply may not be able to detect the encoding in this particular case.

Using the other library "charset_converter", it works fine. (On both Win+Android)

stringResponse = await CharsetConverter.decode('Shift_JIS', response.bodyBytes);

If you already know that it's Shift-JIS then you don't need to "detect" anything, so there's no reason to use flutter_charset_detector in this case.

amake commented 8 months ago

If your use case is decoding web content, and you think you can trust the encoding returned by the Content-Type header, then you don't need to "detect" anything.

The time when you would "detect" something is:

amake commented 8 months ago

The underlying Android detector is implemented differently and simply may not be able to detect the encoding in this particular case.

I tested this by doing the following:

  1. The underlying Android detector is juniversalchardet, implemented in Java. I wrapped this with a Groovy script I named juniversalchardet:

    #!/usr/bin/env groovy
    
    @Grab('com.github.albfernandez:juniversalchardet:2.4.0')
    import org.mozilla.universalchardet.UniversalDetector
    
    def charsetName = UniversalDetector.detectCharset(System.in)
    println(charsetName)
  2. I tested it on the site you specified like so:

    % curl -s 'https://news.kakaku.com/prdnews/cd=pc/ctcd=0171/id=138700/' | ./juniversalchardet
    SHIFT_JIS

As you can see, it did detect the encoding correctly.

Please double check that you are actually supplying the correct bytes to the autoDecode method, and if so, then please supply a snippet of code that reproduces the problem.

amake commented 8 months ago

Also it would help to know the Android version you're using.

And96 commented 8 months ago

Tried Android 11,13,14 no difference however.

I decode web-content from multiple websites, so it is not always "Shift-JIS", but I had problem only with "Shift-JIS"

Tested again, now, with same websites and same link, it works fine. I didnt change any line of code. I think the websites changed something and return the content with right encoding.

Considered as solved!

amake commented 8 months ago

Thanks for following up.

I was going to suggest that perhaps you aren't checking the HTTP response code, and are passing a garbage body to the detector. Even in that case it's odd that detection would fail, unless the server is returning a document with inconsistent encodings (which is also possible).

If you ever manage to capture a response body that causes a detection error, please report it.