Closed belav closed 1 year ago
Hello, could you clarify what result you want to get? And why?
Let me preface this with the fact that my knowledge of file encodings is fairly limited. I also originally implemented the code to deal with encodings in csharpier around two years ago and forget exactly what problem lead me to using UTF-unknown.
I'm using UTF-unknown to detect file encodings so I can read in the file contents properly. With a file that has the following content
public enum MeetingLocation
{
Café,
Restaurant
}
If I have the file saved as UTF8, then UTF-unknown gives me the following detections
detection.EncodingName detection.Encoding detection.Confidence
windows-1250 System.Text.SBCSCodePageEncoding 0.7516818
utf-8 System.Text.UTF8Encoding+UTF8EncodingSealed 0.505
windows-1252 System.Text.SBCSCodePageEncoding 0.3846154
If I read in the file contents using System.Text.SBCSCodePageEncoding
then I get the following content, which is invalid c# and I am unable to parse it to a SyntaxTree
public enum MeetingLocation
{
Café,
Restaurant
}
I wasn't aware of the multiple detections until today, and was just using detectionResult.Detected.Encoding
to read the file. I was thinking this may be an issue with UTF-unknown not properly detecting this file as UTF8 and wanted to see if that was the case before I look into other possible solutions, like trying to read in the file with other encodings if more than a single one is detected.
I did try replacing the content of tests/Data/utf-8/1.txt
with the enum code and running the tests. Which resulted in the following error.
Charset detection failed for C:\projects\UTF-unknown\Tests\Data\utf-8\1.txt. Expected: utf-8, detected: windows-1250 (75.16818% confidence)
Expected string length 5 but was 12. Strings differ at index 0.
Expected: "utf-8", ignoring case
But was: "windows-1250"
If I add the enum before or after the existing text in 1.txt
then the test passes.
Library uses a heuristic approach to finding encodings. The less data is input, the more likely it is that an error will be made. You can read more about it here https://www-archive.mozilla.org/projects/intl/universalcharsetdetection
Ah okay, that makes sense. I can make use of the multiple detections the library gives me and test each one. Thanks!
With the following file contents
When the file is saved as UTF8, I get the following detections for encoding.
When the file is saved as UTF8-BOM, then I just get a single detection