NetTopologySuite / NetTopologySuite.IO.ShapeFile

The ShapeFile IO module for NTS.
33 stars 25 forks source link

Maybe a little bug about class DbaseFileHeader's encoding ? #66

Closed RenZachary closed 1 month ago

RenZachary commented 3 years ago

The attribute column header and value of a shp file I read contain Chinese characters. It is correct using qgis. When using the following two methods to read the column header, the column header is garbled, but the value is read correctly:

var sdr = new NetTopologySuite.IO.ShapeFile.Extended.ShapeDataReader(shpPath);
foreach (var f in shp.ReadByMBRFilter(shp.ShapefileBounds))
{
    //■■■■■■■■■■■■■■■■■■■■■■■■■
    //header of f.Attributes  here is not currect 
    //■■■■■■■■■■■■■■■■■■■■■■■■■
}

or


using (var rd = new ShapefileDataReader(readFile, factory, readEncoding))
{
    readHeader = rd.DbaseHeader;

    string[] fieldNames = new string[readHeader.NumFields];
    features = new List<Feature>(readHeader.NumRecords);

    for (int i = 0; i < fieldNames.Length; i++)
    {
        //■■■■■■■■■■■■■■■■■■■■■■■■■
        //rd.GetName(i + 1) here is not currect
        //■■■■■■■■■■■■■■■■■■■■■■■■■
        fieldNames[i] = rd.GetName(i + 1);
    }
    ·····

even

//test all encoding ■■■■■■■■
foreach (var encoding in Encoding.GetEncodings())
{using (var rd = new ShapefileDataReader(readFile, factory, encoding ))
{
    readHeader = rd.DbaseHeader;

    string[] fieldNames = new string[readHeader.NumFields];
    features = new List<Feature>(readHeader.NumRecords);

    for (int i = 0; i < fieldNames.Length; i++)
    {
        //■■■■■■■■■■■■■■■■■■■■■■■■■
        //rd.GetName(i + 1) here is not currect
        //■■■■■■■■■■■■■■■■■■■■■■■■■
        fieldNames[i] = rd.GetName(i + 1);
    }
    ·····
}

so i rewrite the class DbaseFileHeader,change the function public void ReadHeader(BinaryReader reader, string filename) to:

public void ReadHeader(BinaryReader reader, string filename)
{
    // type of reader.
    _fileType = reader.ReadByte();
    if (_fileType != 0x03)
        throw new NotSupportedException("Unsupported DBF reader Type " + _fileType);

    // parse the update date information.
    int year = reader.ReadByte();
    int month = reader.ReadByte();
    int day = reader.ReadByte();
    _updateDate = new DateTime(year + 1900, month, day);

    // read the number of records.
    _numRecords = reader.ReadInt32();

    // read the length of the header structure.
    _headerLength = reader.ReadInt16();

    // read the length of a record
    _recordLength = reader.ReadInt16();

    // skip the reserved bytes in the header.
    //in.skipBytes(20);
    byte[] data = reader.ReadBytes(20);
    byte lcid = data[29 - 12]; //get the 29th byte in the file... we've first to read into arry was no 12

    //■■■■■■■■■■■■■■■■■■■■■■■■■
    //_encoding = DetectEncodingFromMark(lcid, filename);
    //■■■■■■■■■■■■■■■■■■■■■■■■■
    _encoding = this.Encoding;
    //■■■■■■■■■■■■■■■■■■■■■■■■■

    //Replace reader with one with correct encoding..
    reader = new BinaryReader(reader.BaseStream, _encoding);
    // calculate the number of Fields in the header
    _numFields = (_headerLength - FileDescriptorSize - 1) / FileDescriptorSize;

    // read all of the header records
    _fieldDescriptions = new DbaseFieldDescriptor[_numFields];
    for (int i = 0; i < _numFields; i++)
    {
        _fieldDescriptions[i] = new DbaseFieldDescriptor();

        // read the field name              
        byte[] buffer = reader.ReadBytes(11);
        // NOTE: only this _encoding.GetString method is available in Silverlight
        String name = _encoding.GetString(buffer, 0, buffer.Length);
        int nullPoint = name.IndexOf((char)0);
        if (nullPoint != -1)
            name = name.Substring(0, nullPoint);
        _fieldDescriptions[i].Name = name;

        // read the field type
        _fieldDescriptions[i].DbaseType = (char)reader.ReadByte();

        // read the field data address, offset from the start of the record.
        _fieldDescriptions[i].DataAddress = reader.ReadInt32();

        // read the field length in bytes
        int tempLength = reader.ReadByte();
        if (tempLength < 0) tempLength = tempLength + 256;
        _fieldDescriptions[i].Length = tempLength;

        // read the field decimal count in bytes
        _fieldDescriptions[i].DecimalCount = reader.ReadByte();

        // read the reserved bytes.
        //reader.skipBytes(14);
        reader.ReadBytes(14);
    }

    // Last byte is a marker for the end of the field definitions.
    // Trond Benum: This fails for some presumeably valid test shapefiles, so I have commented it out. 
    byte lastByte = reader.ReadBytes(1)[0];
    // if (lastByte != 0x0d)
    //   throw new ShapefileException("DBase Header is not terminated");

    // Assure we are at the end of the header!
    if (reader.BaseStream.Position != _headerLength)
        reader.BaseStream.Seek(_headerLength, SeekOrigin.Begin);
}

Actually I just rewrite the code _encoding = DetectEncodingFromMark(lcid, filename);to :_encoding = this.Encoding; Then I can get the header correctly by using Encoding with PageCode=936.

var dbf = shpPath.Substring(0,shpPath.LastIndexOf(".shp"))+".dbf";
FileStream stream = new FileStream(dbf, FileMode.Open, FileAccess.Read, FileShare.Read);
var fileReader = new BinaryReader(stream, Encoding.GetEncoding(936));
var header = new DbaseFileHeaderEx(Encoding.GetEncoding(936));
// read the header
header.ReadHeader(fileReader, dbf);
DGuidi commented 3 years ago

maybe related in someway to #39 and #52 (that were fixed in PR #53? can you add the sample data to test this issue?

RenZachary commented 3 years ago

maybe related in someway to #39 and #52 (that were fixed in PR #53? can you add the sample data to test this issue?

You can use QGIS to create a shapefile layer. When creating it, select the encoding format as GBK, the column header contains Chinese, and the attribute value corresponding to Feature contains Chinese, the problem can be reproduced. fieldName:中文列名 fieldValue:中文属性值 value with Chinese char

GBK

RenZachary commented 3 years ago

I found some logic problem in DetectEncodingFromMark :

private Encoding DetectEncodingFromMark(byte lcid, string cpgFileName)
        {
            Encoding enc;
            if (LdidToEncoding.TryGetValue(lcid, out enc))
                return enc;
            enc = Encoding.UTF8;
            if (String.IsNullOrEmpty(cpgFileName))
                return enc;
            cpgFileName = Path.ChangeExtension(cpgFileName, "cpg");
            //■■■■■■■■■
           //when ".cpg" does not exist, The encoding is utf 8, and the encoding specified by the user in the constructor is not considered
           //■■■■■■■■■
            if (!File.Exists(cpgFileName))
                cpgFileName = Path.ChangeExtension(cpgFileName, "cst");
            if (!File.Exists(cpgFileName))
                return enc;
            string encodingText = File.ReadAllText(cpgFileName).Trim();
            try { return Encoding.GetEncoding(encodingText); }
            catch { }
            return enc;
        }
mumianhua2008 commented 2 years ago

字段名称是中文时乱码现在解决了没有?

DGuidi commented 2 years ago

字段名称是中文时乱码现在解决了没有?

?

DGuidi commented 2 years ago

You can use QGIS to create a shapefile layer.

please add the test data, and specify the version of the library that you use

RenZachary commented 2 years ago

修改部分代码,不再依赖.net core,依赖framework 4.6.1 后重新打包解决了

DGuidi commented 2 years ago

Can you post in English?thanks

RenZachary commented 2 years ago

修改部分代码,不再依赖.net core,依赖framework 4.6.1 后重新打包解决了

Be based on DoNet Framework 4.6.1,then rebuild this project. Modify some codes.

Can you post in English?thanks

DGuidi commented 2 years ago

as you may notice (see the homepage) the development of this project has been discontinued: howewer I will do a check IF you will publish some test data (and not the procedure to generate the wrong data) AND the sample code to reproduce the error (and not the description of the code that should be wrote). thanks.

KubaSzostak commented 1 month ago

We have requested more information regarding this issue, but have not received a response or additional information regarding this issue. Therefore, we are closing this issue.