CharsetDetector / UTF-unknown

Character set detector build in C# - .NET 5+, .NET Core 2+, .NET standard 1+ & .NET 4+
313 stars 47 forks source link

Object reference not set to an instance of an object. #125

Closed MIMAXUZ closed 3 years ago

MIMAXUZ commented 3 years ago

I have several files and I can read them in own ecnoding format. But there is a problem reading a single file. I read the file by determining which codePage contains the information in the file.

I have the following code:

var enocder = CharsetDetector.DetectFromFile(path);
//int encodeResult = enocder.Detected != null ? enocder.Detected.Encoding.CodePage : 28591;
int encodeResult = enocder.Detected.Encoding.CodePage;

Error:

System.NullReferenceException: 'Object reference not set to an instance of an object.'

UtfUnknown.DetectionResult.Detected.get returned null.

But no other file had such a problem. When I open the file via notepad, Encoding shows ANSI. The file is not empty, and contains mostly texts in the Cyrillic alphabet. I taught in 1251, UTF-8 format but ???? character is changing. How can the problem be solved? Thank you!

rstm-sf commented 3 years ago

Hello!

It is possible that the library could not detect what encoding the file has.

When I open the file via notepad, Encoding shows ANSI.

Do you mean nodepad++? This library slightly different algorithm, see https://github.com/CharsetDetector/UTF-unknown/issues/80

i2van commented 3 years ago

NullReferenceException is also thrown if file is empty (file size is 0):

// Detect from File (NET standard 1.3+ or .NET 4+)
DetectionResult result = CharsetDetector.DetectFromFile("path/to/file.txt"); // or pass FileInfo

Maybe it also fails for other methods which accept strings/streams.

304NotModified commented 3 years ago

Please share a full stracktrace, thanks!

i2van commented 3 years ago

I verified detection on empty file/stream/bytes and it works as expected:

[Test]
public void CharsetDetector_EmptyStreamDetection_DetectedShouldBeNull()
{
    const string emptyFile = "empty.txt";

    File.Create(emptyFile).Dispose();

    Assert.IsNull(CharsetDetector.DetectFromFile(emptyFile).Detected);
    Assert.IsNull(CharsetDetector.DetectFromStream(File.Open(emptyFile, FileMode.Open)).Detected);
    Assert.IsNull(CharsetDetector.DetectFromBytes(Array.Empty<byte>()).Detected);
}

@MIMAXUZ It means that encoding detection failed for your file - in this case charsetDetectorResult.Detected.Encoding is null.

304NotModified commented 3 years ago

Thanks for the confirm @i2van

Indeed, Detected could be null is the detection failed.

The code in the start:

int encodeResult = enocder.Detected.Encoding.CodePage

Could indeed throw an exception

recommend usage:

int? encodeResult = enocder.Detected?.Encoding.CodePage;