azist / azos

A to Z Sky Operating System / Microservice Chassis Framework
MIT License
213 stars 29 forks source link

Detect BOM/Encoding if `FileSource : StreamSource` implementation so encoding can be properly read. Adjust the Laconfig reader accordingly #845

Closed itadapter closed 1 year ago

itadapter commented 1 year ago

The implementation of StreamSource that we currently have, does not handle BOM in the file preamble when it reads the stream, consequently the API is not usable for reading files as-is. It must auto-detect encoding similarly to System.IO.StreamReader.

See BOM detection below.

Ralated to #731

itadapter commented 1 year ago

https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.preamble?view=net-6.0 https://en.wikipedia.org/wiki/Byte_order_mark

The Unicode byte order mark (BOM) is serialized as follows (in hexadecimal):

Encoding | Representation (hexadecimal) | Representation (decimal) | Bytes as CP1252 characters -- | -- | -- | -- UTF-8[a] | EF BB BF | 239 187 191 |  UTF-16 (BE) | FE FF | 254 255 | þÿ UTF-16 (LE) | FF FE | 255 254 | ÿþ UTF-32 (BE) | 00 00 FE FF | 0 0 254 255 | ^@^@þÿ (^@ is the null character) UTF-32 (LE) | FF FE 00 00 | 255 254 0 0 | ÿþ^@^@ (^@ is the null character) UTF-7[a] | 2B 2F 76[b][15][16] | 43 47 118 | +/v UTF-1[a] | F7 64 4C | 247 100 76 | ÷dL UTF-EBCDIC[a] | DD 73 66 73 | 221 115 102 115 | Ýsfs SCSU[a] | 0E FE FF[c] | 14 254 255 | ^Nþÿ (^N is the "shift out" character) BOCU-1[a] | FB EE 28 | 251 238 40 | ûî( GB18030[a] | 84 31 95 33 | 132 49 149 51 | „1•3
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
    // Read the BOM
    var bom = new byte[4];
    using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
    {
        file.Read(bom, 0, 4);
    }

    // Analyze the BOM
    if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
    if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
    if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
    if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
    if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
    if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true);  //UTF-32BE

    // We actually have no idea what the encoding is if we reach this point, so
    // you may wish to return null instead of defaulting to ASCII
    return Encoding.ASCII;
}