MarkPflug / Sylvan.Data.Excel

The fastest .NET library for reading Excel data files.
MIT License
243 stars 30 forks source link

ExcelDataReader.GetValues() uses FieldCount rather than RowFieldCount #174

Closed JaseRandall closed 5 months ago

JaseRandall commented 5 months ago

I have a .xlsx file that is a jagged array with column headings appearing on a row that is not row 1 (could be anywhere from row 5 to 50).

I see from Issue #114 that I can use reader.Initialize() to set the current row as the row with the column headers. Also, #157 has "Instead of: ed.FieldCount, use ed.RowFieldCount instead." which I use to create allocate the array fed into GetValues().

However, while looping to find the header row, I'm not able to view all the values on each row because GetValues() uses FieldCount rather than RowFieldCount.

It's easier to see using code and code comments:


// Here's an extract of my code based on the documentation
using (var reader = ExcelDataReader.Create(filename))
{
    do
    {
        while (reader.Read())
        {
            if (reader.RowNumber < 9)
                continue;

            // values shown by the debugger:
            // reader.FieldCount == 3
            // reader.RowFieldCout == 7
            object[] arrayOfValues = new object[reader.RowFieldCount];
            reader.GetValues(arrayOfValues);
            // arrayOfValues has 7 fields however, GetValues() only fills the first 3 fields with data
            // This is because GetValues() uses FieldCount rather than RowFieldCount (see below)
        }
    } while (reader.NextResult());
}

// Extract from Sylvan.Data.Excel.ExcelDataReader
public abstract partial class ExcelDataReader : DbDataReader, IDisposable, IDbColumnSchemaGenerator
{
    public sealed override int GetValues(object[] values)
    {
        ValidateAccess();

// should the next line use RowFieldCount rather than FieldCount?
// considering if someone wants to only use FieldCount, they can initialise the array size using FieldCount
// and then we don't need an additional method to use rowFieldCount

        var c = Math.Min(values.Length, this.FieldCount);

        for (int i = 0; i < c; i++)
        {
            values[i] = this.GetValue(i);
        }
        return c;
    }

    // Alternatively, should I submit a pull request containing a method for GetRowValues?
    public int GetRowValues(object[] values)
    {
        ValidateAccess();
        var c = Math.Min(values.Length, this.rowFieldCount);

        for (int i = 0; i < c; i++)
        {
            values[i] = this.GetValue(i);
        }
        return c;
    }
}

So, I guess my question is, should ExcelDataReader.GetValues use FieldRowCount to determine how many times to loop filling data? Or should there be a specific GetRowValues method?

MarkPflug commented 5 months ago

I think you have a variety of options here.

You can quite easily create your own GetRowValues extension method for ExcelDataReader:

static class ExcelDataReaderExtensions
{
    static int GetRowValues(this ExcelDataReader csv, object[]? values)
    {
        var c = csv.RowFieldCount;
        if (values == null) return c; // allow querying count

        // throw because there isn't enough room, or alternately you could fill the space available?
        if (values.Length < c) throw new ArgumentOutOfRangeException(); 

        for (int i = 0; i < c; i++)
        {
            values[i] = csv.GetValue(i);
        }
        return c;
    }
}

Or, the Sylvan.Data nuget package provides a GetVariableField extension method, which provides basically the same functionality. Here is a complete example using Sylvan.Data.Csv.CsvDataReader, but should work just as well with ExcelDataReader:

using Sylvan.Data.Csv;
using Sylvan.Data;

var data =

    """
    this is a junk row
    a,b,c
    1,2,3
    1,2,3,4
    1,2,3
    1,2
    """;

var csv = CsvDataReader.Create(new StringReader(data));
csv.Read(); // skip the first junk row

// initialize using the headers on row 2
csv.Initialize();

// create a data reader that reads jagged rows via RowFieldCount
// any *extra* columns will be treated as type "string"
var jagged = csv.AsVariableField(r => r.RowFieldCount, typeof(string));

object[] values = new object[100];
while (jagged.Read())
{
    var count = jagged.GetValues(values);
    Console.WriteLine("Count: " + count);
    for (int i = 0; i < count; i++)
    {
        Console.WriteLine(values[i]);
    }
    Console.WriteLine("---");
}

Which outputs:

Count: 3
1
2
3
---
Count: 4
1
2
3
4
---
Count: 3
1
2
3
---
Count: 2
1
2
---

In this case, you still don't know how many columns are in a row without inspecting RowFieldCount, so the values object array might not be big enough.

Either way, I don't a modification to the library is needed to handle this scenario.