MarkPflug / Sylvan.Data.Excel

The fastest .NET library for reading Excel data files.
MIT License
248 stars 33 forks source link

ExcelFormat incorrectly converting values formatted as h:mm:ss #196

Open JaseRandall opened 3 days ago

JaseRandall commented 3 days ago

I have found an issue and a potential solution:

I have an excel worksheet with a value of 0, formatted as Custom "[h]:mm:ss" displaying "0:00:00". When I use the code below, the value is returned as "0001-01-01", which is obviously not the desired value.

using (var reader = ExcelDataReader.Create(filename)) //, new ExcelDataReaderOptions { Schema = excelSchema }))
{
    do
    {
        while (reader.Read())
        {
            for (int i = 0; i < reader.RowFieldCount; i++)

            var a = reader.GetExcelDataType(i); // returns a data type of Numeric
            var value = reader.GetValue(i); // returns a value of 0001-01-01
        }
    } while (reader.NextResult());
}

Within ExcelDataReader, you have:

string FormatVal(int xfIdx, double val)
{
    var fmtIdx = xfIdx >= this.xfMap.Length ? -1 : this.xfMap[xfIdx];
    if (fmtIdx == -1)
    {
        return val.ToString();
    }

    if (formats.TryGetValue(fmtIdx, out var fmt)) //fmt is coming out of the dictionary as {[46, {[h]:mm:ss (Time)}]}
    {
        return fmt.FormatValue(val, this.dateMode); // this code is incorrectly converting 0:00:00 to 01-01-001
    }
    else
    {
        return val.ToString();
    }
}

which leads to:

    internal string FormatValue(double value, ExcelDataReader.DateMode mode)
    {
        var kind = this.Kind;
        switch (kind)
        {
            case FormatKind.Number:
                return value.ToString("G");
            case FormatKind.Date:
            case FormatKind.Time:
                if (ExcelDataReader.TryGetDate(this, value, mode,  out var dt)) // Here's the problem, we've gone into FormatKind.Time but what's really happening is that the value is being converted into a date.
                {
                    if (dt.TimeOfDay == TimeSpan.Zero)
                    {
                        return IsoDate.ToDateStringIso(dt); // here the time value is being converted into a DateString when we really want a TimeString
                    }
                    else
                    {
                        return IsoDate.ToStringIso(dt);
                    }
                }
                else
                {
                    // for values rendered as time (not including date) that are in the
                    // range 0-1 (which renders in Excel as 1900-01-00),
                    // allow these to be reported as just the time component.
                    if (value < 1d && value >= 0d && Kind == FormatKind.Time)
                    {
                        // omit rendering the date when the value is in the range 0-1
                        // this would render in Excel as the date 
                        var fmt = "HH:mm:ss.FFFFFF";
                        dt = DateTime.MinValue.AddDays(value);
                        return dt.ToString(fmt);
                    }
                }
                // We arrive here for negative values which render in Excel as "########" (not meaningful)
                // or 1900-01-00 date, which isn't a valid date.
                // or 1900-02-29, which is a non-existent date.
                // The value can still be accessed via GetDouble.
                return string.Empty;
        }
        return value.ToString();
    }

So, for case FormatKind.Time, what we really need to do is:

case FormatKind.Time:
    if (ExcelDataReader.TryGetDate(this, value, mode, out var dt1))
    {
        if (Format.Contains("[h]"))
        {
            // Custom handling for [h]:mm:ss format
            TimeSpan timeSpanSinceEpoch = dt1 - DateTime.MinValue;
            int totalHours = (int)timeSpanSinceEpoch.TotalHours;
            return $"{totalHours}:{timeSpanSinceEpoch.Minutes:D2}:{timeSpanSinceEpoch.Seconds:D2}";

        }
        else
        {
            // General handling for other formats
            return dt1.ToString(ParsingFormat);
        }
    }
    return string.Empty;

As can be seen, we will also need to add a string ParsingFormat { get; set; } property to the ExcelFormat class and overload the constructors like:

    internal ExcelFormat(string spec)
    {
        this.Format = spec;
        this.Kind = DetermineKind(spec);
        ParsingFormat = spec;
    }

    internal ExcelFormat(string spec, FormatKind kind, string? format = null)
    {
        this.Format = spec;
        this.Kind = kind;
        ParsingFormat = format ?? (ParsingFormat = string.Empty);
    }

we would also need to modify the following code in the static constructor of ExcelFormat:

        fmts[18] = new ExcelFormat("h:mm AM/PM", FormatKind.Time, "h:mm tt");
        fmts[19] = new ExcelFormat("h:mm:ss AM/PM", FormatKind.Time, "h:mm:ss tt");
        fmts[20] = new ExcelFormat("h:mm", FormatKind.Time, "h:mm");
        fmts[21] = new ExcelFormat("h:mm:ss", FormatKind.Time, "h:mm:ss");
        fmts[22] = new ExcelFormat("m/d/yy h:mm:ss", FormatKind.Date, "m/d/yy h:mm:ss");
        fmts[45] = new ExcelFormat("mm:ss", FormatKind.Time, "mm:ss");
        fmts[46] = new ExcelFormat("[h]:mm:ss", FormatKind.Time, "[h\\:mm\\:ss]");
        fmts[47] = new ExcelFormat("mm:ss.0", FormatKind.Time, "mm\\:ss\\.f");

So, the question becomes, is there a workaround that you can suggest without modifying the ExcelFormat code or would the proposed solution fit and therefore should I work out and write some tests and submit a pull request?

JaseRandall commented 3 days ago

Ok, been doing more testing and realised I could probably just use:

var value = reader.GetValue(i);
var dateTime = DateTime.Parse(value.ToString());
var timeSpan = new TimeSpan(dateTime.Ticks);

However, I have now found a different issue with this code: (Actually, probably the original issue that led me down a rabbit hole.)

static internal bool TryGetDate(ExcelFormat fmt, double value, DateMode mode, out DateTime dt)
{
    dt = DateTime.MinValue;
    DateTime epoch = Epoch1904;
    // Excel doesn't render negative values as dates.
    if (value < 0.0)
        return false;
    if (mode == DateMode.Mode1900)
    {
        epoch = Epoch1900;
        if (value < 61d)
        {
            if (value < 1)
            {
                if (fmt.Kind == FormatKind.Time)
                {
                    dt = DateTime.MinValue.AddDays(value);
                    return true;
                }

                // 0 is rendered as 1900-1-0, which is nonsense.
                // negative values render as "###"
                // so we won't support accessing such values.
                return false;
            }
            if (value >= 60d)
            {
                // 1900 wasn't a leapyear, but Excel thinks it was
                // values in this range are in-expressible as .NET dates
                // Excel renders it as 1900-2-29 (not a real day)
                return false;
            }
            value += 1;
        }
    }
    dt = epoch.AddDays(value);
    return true;
}

When reading a value out of Excel such as 2:58:48 with cell formating of [h]:mm:ss, it gets converted to a DateTime value of 1/01/0001 2:58:48 AM but when converting a value of 720:00:00 with a cell format of [h]:mm:ss, it gets converted to 30/01/1900 12:00:00 AM.

It's because it passes through epoch.AddDays(value); without looking at the fmt to see that it's a [h]:mm:ss (Time) format and therefore should be converted via this code: dt = DateTime.MinValue.AddDays(value); which would correctly convert it to: 31/01/0001 12:00:00 AM which is correctly converted into a TimeSpan of 720 hours.

MarkPflug commented 3 days ago

If you want a timespan, then you should ask for one, and use GetTimeSpan(int ordinal) instead of GetValue(int ordinal). GetValue uses the provided schema to determine what type of value to return, and with the default schema everything is treated as a string. I think you're right that there could be an improvement with GetValue: it should return a timespan formatted value when the format of the numeric is identified as a time value. However, I think you should just call GetTimeSpan, or provide a schema if you know the "shape" of the data ahead of time.

MarkPflug commented 1 day ago

I think it makes sense to fix this behavior. When accessing a value formatted as a time of day, it should be returning a string that represents the timespan. I've implemented this fix to be included in 0.4.16. Please try the prerelease build and verify that it works the way you expected it to: https://www.nuget.org/packages/Sylvan.Data.Excel/0.4.26-b0001

JaseRandall commented 21 hours ago

Thanks! but it's not yet quite how I would like it.

image Correctly returning as 00:17:43

image Incorrectly returning as 1900-01-02T00:17:43

If I didn't know the schema in advance, I would prefer to receive values that reflect the display value rather than a DateTime value that would appear if the data had of been formatted as a DateTime format.

For some context, my use case is to read an excel file into a DataTable then display in a DataGrid. I understand there might be a question of why not just open it in Excel to view but Excel has quirks that result in it modifying data so my preference is a custom data viewer but that relies on values being read in a manner that is an accurate representation of the values.

(The quirks in Excel include the automatic conversion of large numbers to scientific notation, truncates number type values with more than 15 digits to 15 digits and in some cases changes dd/mm/yyyy format values to mm/dd/yyyy values with no ability to change back via display formatting.)