albfernandez / juniversalchardet

Originally exported from code.google.com/p/juniversalchardet
Other
339 stars 60 forks source link

Always detecting US-ASCII for UTF-8 encoded files #35

Open neerajjain92 opened 4 years ago

neerajjain92 commented 4 years ago

I tried

UniversalDetector detector = new UniversalDetector();
FileInputStream fis = new FileInputStream(file);
byte[] buf = new byte[4096];
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
    detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();

// (4)
String encoding = detector.getDetectedCharset();
System.out.println(encoding);

It shows US-ASCII

albfernandez commented 4 years ago

On small files, if all characters are ASCII, the default (now) is return US_ASCII as encoding. I need to see some sample if it is not the case.

amake commented 4 years ago

UTF-8 is a superset of ASCII so if a file doesn't have any characters outside of ASCII then I don't think there's a meaningful way to identify it as UTF-8.

yangsichen commented 4 years ago

it doesn't work while charsets is too short,how can i solve it.

DarkTyger commented 7 months ago

I need to see some sample if it is not the case.

The following unit test has a string that will be detected as TIS-620, where UTF-8 would be preferred:

import org.junit.jupiter.api.Test;
import org.mozilla.universalchardet.UniversalDetector;

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotNull;

public class EncodingTest {
  @Test
  public void test_Encoding_UTF8_UTF8() {
    final var bytes = testBytes();

    final var detector = new UniversalDetector( null );
    detector.handleData( bytes, 0, bytes.length );
    detector.dataEnd();

    final var expectedCharset = StandardCharsets.UTF_8;
    final var detectedCharset = detector.getDetectedCharset();

    assertNotNull( detectedCharset );

    final var actualCharset = Charset.forName( detectedCharset );

    assertEquals( expectedCharset, actualCharset );
  }

  private static byte[] testBytes() {
    return
      "One humid afternoon during the harrowing heatwave of 2060, Renato Salvatierra, a man with blood sausage fingers and a footfall that silenced rooms, received a box at his police station. Taped to the box was a ransom note; within were his wife's eyes. By year's end, a supermax prison overflowed with felons, owing to Salvatierra's efforts to find his beloved. Soon after, he flipped profession into an entry-level land management position that, his wife insisted, would be, in her words, *infinitamente más relajante*---infinitely more relaxing."
      .getBytes();
  }
}

Reports:

org.opentest4j.AssertionFailedError: 
Expected :UTF-8
Actual   :TIS-620

A similar scenario caused US-ASCII to be detected, as well, despite there being a diacritic.