MarkPflug / Sylvan.Data.Excel

The fastest .NET library for reading Excel data files.
MIT License
243 stars 30 forks source link

Formats Issue #18

Closed KMastalerz closed 2 years ago

KMastalerz commented 2 years ago
  1. I have an issue when using .Load() function to create DataTable, some of my tables in excel contain an int value for the first 10,000 rows, later on it's mixed between int& string. This causes an exception as it cannot load string value to column it specified based on first rows. Is there a way to either look at entire scope of rows or to force all data type as string? as some are even empty. <- This is just an excample, file is a mess column might appear as dateand later have blank cells or even stringsetc or hold doubleand amount that appear as int
  2. Additionaly in below code, RowCount actually return column count.
                using (ExcelDataReader excelDataReader = ExcelDataReader.Create(_filePath))
                {
                    int WsCount = excelDataReader.WorksheetCount;
                    int ColumnCount = excelDataReader.FieldCount;
                    int RowCount = excelDataReader.RowCount;
                    DataTable dataTable = new DataTable();
                    dataTable.Load(excelDataReader);
                }
  3. Additionaly there some reports i have that start table on 3rd row (Headers from A3 to K3), 1st row beeing automated message (only cell A1) and second empty. .Load() function sets table on first column only. although .FieldCount is set correctly. It would be helpuff to be able to pass int declaring from which row extract should occur, if it's possible. I'll test later if iteration works in this case.

I'll wait for nuget with 0.1.7 as I do not now how to compile the code to implement .dll, im to fresh :)

But other than that, claim that it's the fastest reader seems to be warranted, great work :)

MarkPflug commented 2 years ago
  1. The default schema for the ExcelDataReader exposes everything as "string?" (nullable string). It makes no effort to identify column types. You can explicitly specify the schema though, if you know it ahead of time. See this documentation. Regardless of the schema specified, you can use any of the accessor methods (GetInt32, GetBool, GetDouble, etc) to read a value. Those methods will internally coerce the values to the requested type, or throw if the coersion isn't possible.

  2. What is the file type being loaded here? RowCount cannot always be determined, since it depends on a piece of header metadata that isn't always present in all files. If you can provide an example file that exhibits the behavior that would be the easiest to debug. When RowCount can't be determined, it should be returning -1, however I think it might be incorrectly returning 0 right now, which I'll fix.

  3. This one is a bit of a problem, and I'm trying to decide how best to handle it in the API. This library is currently best suited to Excel files that are rectangular and don't have extra lines to skip. FieldCount will return the number of columns that it found in the first row. So, if your files has a header in cell A1, it might report FieldCount as 1. You can also inspect RowFieldCount, which will return the number of columns in the current row, which is useful when access "jagged" data.

KMastalerz commented 2 years ago
  1. I decided to fill in data with a loop and to show me row and column index of data that gets an error, this showed me that someone writes stuff like "-Firstname, Lastname", this is read as "#NAME". I think this is the root cause of this error, ill be able to delete those lines to test that in a few hours.
  2. I'll prepare a file for you, I'll have to replace sensitive information in it. Although the file is quite large, more than 20MB this is due to data count. Any preference on how to link it?
  3. This will be a problem in my project most of the files have an irregular format, so I'll get back to ExcelDataReader, but id like to keep testing your solution, as it's proving to be very effective.
MarkPflug commented 2 years ago

1) #NAME is an Excel error code, indicating that there is a formula error. If you type that into Excel it will interpret it as a formula and automatically prepend an = in front of it. Excel also displays it as #NAME.

2) I think you can just drag it into this issue (there is a message at the bottom of this text input area).

3) Yep, Excel is notorious for having bad data. Unfortunately, I can't solve all of the problems people have in their files. There are certain patterns I'd like to be able to support though, so if you have examples I'd like to see them.

KMastalerz commented 2 years ago

I know that #NAME is an excel error, although I'm not sure how to handle it, although I'm not sure if this is a real issue, this issue appeared after I saved the .csv file as .xlsx for testing, so I'll leave it for tomorrow, as today my company laptop is crashing a bit. I will create a test file to mimic the structures of a few workbooks on each sheet. I already had one previously but wasn't able to load it here. I can upload it on my git as public tomorrow though, and throw in a link here.

KMastalerz commented 2 years ago

Table At Lower Level with some information to be accessed above.xls Table At Lower Level.xlsx

KMastalerz commented 2 years ago

1 - I have to access colun A2, then move to table with rates. 2 - Originally there is no Sheet3, only Sheet1 [source for our data extract] i left it though for testing purposes. Some other reports have similar issue, but start at different rows. Some have details in columns above, some do not. 3 (no file) - Some handler would be helpful if there is an "#NAME" etc to take literal value not formula calculated one. Im not sure if this is even possible, in case of this library.

KMastalerz commented 2 years ago

I Additionaly tested the files with newest prereleased version:

Exception on .Load() : System.NullReferenceException: 'Object reference not set to an instance of an object.'

Code:

                using (Syl.ExcelDataReader excelDataReader = Syl.ExcelDataReader.Create(_filePath))
                {
                    do
                    {
                        DataTable = new DataTable();
                        DataTable.Load(excelDataReader);

                    } while (excelDataReader.NextResult());
                }

Exception on .NextResult() : System.InvalidOperationException: 'Operation is not valid due to the current state of the object.'

Code:

                using (Syl.ExcelDataReader excelDataReader = Syl.ExcelDataReader.Create(_filePath))
                {
                    do
                    {
                        //just loop

                    } while (excelDataReader.NextResult());
                }
MarkPflug commented 2 years ago

Thanks for the files, I'll try to take a look and get back to you soon.

KMastalerz commented 2 years ago

By reviewing XML, I can see some problems now, some of the files we use can have rearanged sheets, additional thay can have deleted sheets. I had copied such file and replaced data (Table At Lower Level.xlsx) so i think that the looping issue lies below:

name="Sheet3" sheetId="3" r:id="rId1"/><sheet name="Sheet1" sheetId="1"

MarkPflug commented 2 years ago

@KMastalerz I've pushed v.0.1.7-b0006 that I think should fix these issues. There was a bug with handling sheets that had been rearranged. I also added the ability to reinitialize the worksheet with a schema after reading some rows, which allows stepping over the header rows. This gist shows how you can load your spreadsheet into a DataTable. https://gist.github.com/MarkPflug/6ef1fd9f6fa0579d61ce66657d662451

Let me know if you get this working and I'll push these bits as an official 0.1.7 release.

MarkPflug commented 2 years ago

Also meant to mention that you can set the option GetErrorsAsNull to cause the reader to process all cells with errors as if they were null instead.

var opts = new ExcelDataReaderOptions {
  GetErrorAsNull = true,
};
var edr = ExcelDataReader.Create("data.xlsx", opts);

This defaults to false, in which case an exception will be thrown if you try to access the value in a cell that has a formula error.

KMastalerz commented 2 years ago

Ok, file delivered seems to be working, but i have exception on other: {"Index was outside the bounds of the array."},

This is due to the fact that the file in question has headers with , comma in them.

Here is my code:

        private void SylvanReader(string _filePath, string? _sheetName = null, FilesStructures.Structure? _structure = null)
        {
            try
            {

                string headerSchema = "";

                //build headers
                if(_structure!= null)
                {
                    foreach(string header in _structure.HeaderList)
                    {
                        headerSchema += header + ", ";
                    }
                    //remove last ", "
                    headerSchema = headerSchema.Substring(0, headerSchema.Length - 2);
                }

                var options = new Syl.ExcelDataReaderOptions
                {
                    Schema = Syl.ExcelSchema.NoHeaders
                };

                //collect data
                using (Syl.ExcelDataReader excelDataReader = Syl.ExcelDataReader.Create(_filePath, options))
                {
                    //loop to locate sheet
                    while (excelDataReader.WorksheetName != _sheetName)
                    {
                        excelDataReader.NextResult();

                        if (excelDataReader.WorksheetName == null)
                        {
                            throw new Exception("didnt find the sheet");
                        }
                    }
                    //loop to find headers
                    if (_structure != null)
                    {
                        while (excelDataReader.Read())
                        {
                            var column = excelDataReader.GetString(0);

                            if (column == _structure.HeaderList[0])
                            {
                                break;
                            }
                        }

                        // parse the schema, and use it to reinitialize the schema for the sheet.
                        var schema = SylData.Schema.Parse(headerSchema);
                        excelDataReader.InitializeSchema(schema.GetColumnSchema(), useHeaders: true);

                        DataTable = new DataTable();
                        DataTable.Load(excelDataReader);
                    }
                }
            }
            catch(Exception ex)
            {
                throw new Exception(ex.Message);
            }
        }

Originally i used loop to get all columns not the ones actually need, where the column with comma isn't needed. So i tried below code, unfortunately it doesn't map the column i need it just takes from 0 to nr of elements in schema list. It did work on file with comma in column, as it wasnt selected, but ive got wrong results, i could work around this though, as it's far beyond table col index that i need and delete columns not needed. But i though you should know.

quick summary: _structure.ColumnList = List<T>, class contains string with Header from file, and this list containcs only column i actually need. _structure.HeaderList= List<string>, class contains all headers from file.

                //build headers
                if(_structure!= null)
                {
                    foreach(var columnSetup in _structure.ColumnList)
                    {
                        headerSchema += columnSetup.columnName + ", ";
                    }
                    //remove last ", "
                    headerSchema = headerSchema.Substring(0, headerSchema.Length - 2);
                }

For now i can work around this i think. But there are stuff worth to look at, possibly mapping column name to result would be enough just to skip columns with comma in name.

File in question: Note: i cannot attach .xlsb file so i zipped it (gist doesn't allow excel at all), and deminished the row count: Test File For Mark.zip

But it's worth to mention: I can work around this and speed of Sylvan, totally justifies doing that.

Nevertheless: i've got another exception, that i cannot create workaround for.

Exception: System.NotImplementedException: 'The method or operation is not implemented.'

Exception occurs here: using (Syl.ExcelDataReader excelDataReader = Syl.ExcelDataReader.Create(_filePath, options))

It was caused by presence of this sheet : This sheet is originally hidden, once i deleted it rest worked fine. But even standalone, shows same error. Legal.zip

MarkPflug commented 2 years ago

I just pushed 0.1.7-b0007 to nuget. It fixes the NotImplementedException, and adds a new options ReadHiddenWorksheets that allows reading or skipping hidden sheets, defaults to false so that hidden sheets are skipped.

The index out of bounds is coming out of Schema.Parse? You should be able to just rename the columns to not have commas in the schema specification string, they don't have to match the names in the worksheet when you call InitializeSchema.

KMastalerz commented 2 years ago

I can confrim that both: NotImplementedException& ReadHiddenWorksheets, are working correctly.

One issue that i see, but are not something that bothers me much, but i thing you should know: WorksheetCount displays the number of all sheets, doesn't matter if ReadHiddenWorksheets was set to true of false. Although, im not sure if you want to change that, as in fact the sheet count is true, just not what we extracted.

And last System.IndexOutOfRangeException: 'Index was outside the bounds of the array.' comes from excelDataReader.InitializeSchema(schema.GetColumnSchema(), useHeaders: true); but only on file, where , comma is in column name, once i removed/changed this column, it was working fine. File is attached as: Test File For Mark.zip

The code i use:

 private void SylvanReader(string _filePath, string _sheetName, FilesStructures.Structure _structure)
        {
            try
            {

                string headerSchema = "";

                //get unique column list to test for duplicates
                List<string> columnList = new List<string>();

                //build headers, **doesnt work, when date? is string due to user errors etc, cannot be user on both CSV and Excel**
               //_structure.HeaderList = List<string>, previously build list saved in json to reflect all headers in needed sheet. 
                foreach (var header in _structure.HeaderList)
                {
                    if (!columnList.Exists(column => column == header))
                    {
                        headerSchema += header + ", ";
                        columnList.Add(header);
                    }
                }
                //remove last 2 chars
                headerSchema = headerSchema.Substring(0, headerSchema.Length - 2);

                var options = new SylExcel.ExcelDataReaderOptions
                {
                    Schema = SylExcel.ExcelSchema.NoHeaders,
                    GetErrorAsNull = true
                };

                //collect data
                using (SylExcel.ExcelDataReader excelDataReader = SylExcel.ExcelDataReader.Create(_filePath, options))
                {
                    //loop to locate sheet
                    while (excelDataReader.WorksheetName != _sheetName)
                    {
                        excelDataReader.NextResult();

                        if (excelDataReader.WorksheetName == null)
                        {
                            throw new Exception("didnt find the sheet");
                        }
                    }
                    //loop to find headers
                    if (_structure != null)
                    {
                        while (excelDataReader.Read())
                        {
                            var column = excelDataReader.GetString(0);

                            if (column == _structure.HeaderList[0])
                            {
                                break;
                            }
                        }

                        // parse the schema, and use it to reinitialize the schema for the sheet.
                        var schema = SylData.Schema.Parse(headerSchema);
                        excelDataReader.InitializeSchema(schema.GetColumnSchema(), useHeaders: true);

                        DataTable = new DataTable();
                        DataTable.Load(excelDataReader);
                    }
                }
            }
            catch(Exception ex)
            {
                throw new Exception(ex.Message);
            }
        }

Additionaly i confirm, that once i changed row name, and removed commas in json, it worked fine. For me it's ready for release, feel free to let me know if you need someone to perform tests at "krzysztof.mastalerz@outlook.com", as this library is great, and im happy to help out!

MarkPflug commented 2 years ago

Thanks for your help with this! The scenarios you presented were really helpful to fix up some real-world use-cases I hadn't considered. I'll take a look at the headers with commas issue and see what can be done there. I going to close this issue and move it to a different one to track that specific problem.

KMastalerz commented 2 years ago

I'll be looking at this, and testing this library will be life saving to tons of people.

MarkPflug commented 2 years ago

@KMastalerz Can you provide the string value of headerSchema when the index out of range exception is thrown?

KMastalerz commented 2 years ago

Hello Mark, I left my company PC as it needed updgrade, and i have no access to those files right now. But it's attached as "Test File For Mark.zip" in the longest answer. I think AN col, sheet "Full list"

KMastalerz commented 2 years ago

This list _structure.HeaderList = List<string> originaly contained all column headers reflected to a letter from excel, i changed it as you pointed out it's not necessary to name them the same, i thought it's mandatory. After changing it works fine, as Schema.Parser use delimiter of , i don't se much point in upgrade, this issue was my misunderstanding. I see more value added if we could extract data only from parsed headers, lets say i parse column from B, C, D, F and have columns for A, B ,C ,D ,E ,F, G, right now i i parse B, C ,D ,F headers im still getting values from A, B , C ,D mapped to them.

MarkPflug commented 2 years ago

I haven't been able to repro the index out of range yet, but hopefully with this info I can.

As far as selecting specific columns out of the file, you can use the Select extension method from the Sylvan.Data library:

DbDataReader reader = ExcelDataReader.Create("File.xlsx");
reader = reader.Select("B", "C", "D", "F");

It can also accept column ordinals instead of names.

KMastalerz commented 2 years ago

I didn't know that, that's helpful i have to be more careful and check your documentation :) Im currently working on code, to call async extract and bulkcopy on 5 files, as you library is not taking much time, and resources it's perfect for that.

But i cannot test it, as im working only on code from github repositry. On monday im hopeing for great results.

MarkPflug commented 2 years ago

have to be more careful and check your documentation

Hah, good luck with that, my docs are ... lacking for sure.

KMastalerz commented 2 years ago

Hello Mark,

How i can select columns after extracted data.

I am extracing with Sylva.Data.Excel, but .Select() belongs to Sylva.Data.

Im not sure how to do this in below:

   using (SylExcel.ExcelDataReader excelDataReader = SylExcel.ExcelDataReader.Create(_filePath, options))
                {
                    //loop to locate sheet
                    while (excelDataReader.WorksheetName != _sheetName)
                    {
                        excelDataReader.NextResult();

                        if (excelDataReader.WorksheetName == null)
                        {
                            throw new Exception("didnt find the sheet");
                        }
                    }
                    //loop to find headers
                    for(int i = 0; i < _structure.StartRow; i++)
                    {
                        excelDataReader.Read();
                    }

                    **//limit columns here? how?**

                    // parse the schema, and use it to reinitialize the schema for the sheet.
                    var schema = SylData.Schema.Parse(headerSchema);
                    excelDataReader.InitializeSchema(schema.GetColumnSchema(), useHeaders: true);

                    DataTable = new DataTable();
                    DataTable.Load(excelDataReader);
                }
MarkPflug commented 2 years ago

You should put it after InitializeSchema(...).

var data = excelDataReader.Select("columnName1", "ColumnName2", etc);
DataTable = new DataTable();
DataTable.Load(data); // data table should only get the selected columns.
KMastalerz commented 2 years ago

I have one idea, although im not sure if it's possible to be done and i do not see idea tab i this repo. Bu i think both excel and csv reader would also benefit for eventual casting of data into a List<T>, this would allow quicker work on heavier datasets :)

Additional Query, can you loop csv reader same as excel reader (without loading to DataTable)? :

        for(int i = 0; i < edr.FieldCount; i++)
        {
            var value = edr.GetString(i);
        }
MarkPflug commented 2 years ago

The Sylvan.Data package contains a general-purpose data binder. It can be used with any DbDataReader implementation, which includes both my Excel and CSV readers, but also most other data providers in .NET. You can use the GetRecords<T> extension method to easily bind to a strongly-typed object.

using Sylvan.Data;
using Sylvan.Data.Csv;

const string data = @"Id,Name,Date,IsActive,Amount
1,Alpha,2020-01-01,true,12.75
2,Beta,2021-01-01,false,33.29
";

var reader = CsvDataReader.Create(new StringReader(data));
var records = reader.GetRecords<Item>();

foreach(var record in records)
{
    Console.WriteLine($"{record.Name}: {record.Amount}");
}

class Item
{
    public int Id { get; set; }
    public string Name { get; set; }
    public DateTime Date { get; set; }
    public bool IsActive { get; set; }
    public decimal Amount { get; set; }
}

You can interact with the CsvDataReader and the ExcelDataReader in pretty much the exact same way, yes.

KMastalerz commented 2 years ago

Hmmm, 1st problem i have is that reader.GetRecords<Item>(); seems to need specified class, second it's reading as IEnumerable<T> whete T has to be specified and i am unable to cast it as List<T> of course i could use write specific extensions for each excel, but I rather have generic one. The other problem is some csv files have messed up delimiters, this causes shift in columns on some rows. So intead of DateTime? i have string, that caused crash with IEnumerable<T> is there a way to read them as null instead? I'm not sure if im missing something.

I tried to create my own converters, they work but those fiels have 120,000 + rows, it's painfully slow due to reflector usage.

This is what i had tried:

       /// <summary>
        /// get data from .csv, .xls, .xlsx, .xlsm, .xlsb. Call is then redirected accordingly
        /// </summary>
        public List<T> GetData<T>()
        {
            //check setup
            if (string.IsNullOrWhiteSpace(Setup.FilePath)) throw new ArgumentNullException("please setup excel connection run");
            if (Setup.HeaderList == null) throw new ArgumentNullException("please setup excel connection run");
            if (Setup.HeaderList.Count == 0) throw new ArgumentNullException("please setup excel connection run");
            if (Setup.ColumnMapping == null) throw new ArgumentNullException("please setup excel connection run");
            if (Setup.ColumnMapping.Count == 0) throw new ArgumentNullException("please setup excel connection run");

            List<T> _temp;

            try
            {
                string extension = Path.GetExtension(Setup.FilePath);

                switch (extension)
                {
                    case ".csv":
                        _temp = SylvanReaderCSV<T>(Setup.FilePath, Setup.HeaderList);
                        break;
                    default:
                        _temp = SylvanReader<T>(Setup.FilePath, Setup.SheetName, Setup.HeaderList, Setup.StartRow);
                        break;
                }
                //assign numbers
                RowCount = _temp .Count;
                //collect file information
                GetMetadata(Setup.FilePath);
                //data table is too large for further operations
                return _temp ; 

            }
            catch (Exception ex)
            {
                throw new Exception(ex.Message);
            }
        }
        /// <summary>
        /// sylvan reader for .xls, .xlsx, .xlsm, .xlsb
        /// </summary>
        private List<T> SylvanReader<T>(string _filePath, string? _sheetName, List<string> _headerList, int _startRow)
        {
            try
            {

                string headerSchema = "";
                //get unique column list to test for duplicates
                List<string> columnList = new List<string>();

                //build headers _headerList is List<string> containing headers
                foreach (var header in _headerList)
                {
                    //to avoid duplicates
                    if (!columnList.Exists(column => column == header))
                    {
                        headerSchema += header + ",";
                        columnList.Add(header);
                    }
                }
                //remove last 2 chars
                headerSchema = headerSchema.Substring(0, headerSchema.Length - 1);

                var options = new SylExcel.ExcelDataReaderOptions { Schema = SylExcel.ExcelSchema.NoHeaders, GetErrorAsNull = true };
                //collect data
                using (SylExcel.ExcelDataReader excelDataReader = SylExcel.ExcelDataReader.Create(_filePath, options))
                {
                    //loop to locate sheet
                    while (excelDataReader.WorksheetName != _sheetName)
                    {
                        excelDataReader.NextResult();

                        if (excelDataReader.WorksheetName == null)
                        {
                            throw new Exception("didnt find the sheet");
                        }
                    }
                    //loop to find headers
                    for(int i = 0; i < _startRow; i++)
                    {
                        excelDataReader.Read();
                    }

                    // parse the schema, and use it to reinitialize the schema for the sheet.
                    var schema = Schema.Parse(headerSchema);
                    excelDataReader.InitializeSchema(schema.GetColumnSchema(), useHeaders: true);

                    List<T> temp = (List<T>)excelDataReader.GetRecords<ExcelInvoice>();
                    return temp;

                    //switch will be needed
                    //DataTable = new DataTable();
                    //DataTable.Load(excelDataReader); // data table should only get the selected columns.
                }
            }
            catch(Exception ex)
            {
                throw new Exception(ex.Message);
            }
        }
        /// <summary>
        /// sylvan reader for .csv
        /// </summary>
        private List<T> SylvanReaderCSV<T>(string _filePath, List<string> _headerList)
        {
            try
            {
                string headerSchema = "";
                //get unique column list to test for duplicates
                List<string> columnList = new List<string>();

                //build headers
                foreach (var header in _headerList)
                {
                    if (!columnList.Exists(column => column == header))
                    {
                        headerSchema += header + ",";
                        columnList.Add(header);
                    }
                }

                //remove last 2 chars
                headerSchema = headerSchema.Substring(0, headerSchema.Length - 1);

                var schema = new SylCSV.CsvSchema(Schema.Parse(headerSchema));
                var options = new SylCSV.CsvDataReaderOptions //check if get error as null possible
                { 
                    Schema = schema,
                };
                using (SylCSV.CsvDataReader csvDataReader = SylCSV.CsvDataReader.Create(_filePath, options))
                {
                    //switch will be needed
                    //DataTable = new DataTable();
                    //DataTable.Load(csvDataReader); // data table should only get the selected columns.
                    List<T> temp = (List<T>)csvDataReader.GetRecords<ExcelInvoice>();
                    return temp;
                }
            }
            catch (Exception ex)
            {
                throw new Exception(ex.Message);
            }
        }
MarkPflug commented 2 years ago

reader.GetRecords(); seems to need specified class

Yes, that's how C# generics work, you need to provide a concrete type. The DbDataReader interface allows accessing the data dynamically, by name. You can pick whichever is more appropriate, but the GetRecords will require you to define a class.

it's reading as IEnumerable whete T has to be specified and i am unable to cast it as List

You can use the ToList LINQ extension method to turn the IEnumerable into a list. IEnumerable allows accessing the records one by one. ToList will require loading the entire dataset into a single array. Depending on what you're doing you might not actually need the list.

The other problem is some csv files have messed up delimiters, this causes shift in columns on some rows.

I can't help you with that. If the file is corrupt it is going to require someone to fix it, and it's best to go back to the original source of the file and fix it there.

KMastalerz commented 2 years ago

Ok, lat question: Does reader.GetRecords();, had to reflect data types specified in the passed schema, and parameters? Or class here can have a greater count of params?

MarkPflug commented 2 years ago

The data binder behind GetRecords will attempt to coerce the values, so the schema doesn't need to match. However, if the coersion fails, an exception will be thrown. The binder is expected to bind every property on the target class, but the data source can have extra, unbound columns.

Both of these can be configured though, as GetRecords is just a helper extension method that is shorthand for using the binder directly: https://github.com/MarkPflug/Sylvan/blob/8e49c53e787ace672ae96686166e435a359d006d/source/Sylvan.Data/DataExtensions.cs#L34-L43

The DataBinder.Create method accepts an options object that allows configuring the behavior of the binder

var binderOpts = new DataBinderOptions
{
    InferColumnTypeFromMember = true,
    BindingMode = DataBindingMode.All
};
DataBinder.Create<Data>(dataReader, binderOpts);

InferColumnTypeFromMember tells the binder to look at the type of the target property on the class, and use the corresponding accessor on the DbDataReader, regardless of what type the data reader things the column is. This option only works for DbDataReader implementations that can do coersion internally, which both my CSV and Excel readers support, but most DbDataReaders don't.

BindingMode tells the binder whether all the class properties need to be assigned, or all the data columns need to be consumed. All means all properties need to be assigned and all columns need to be consumed, and Any is the least restrictive and allow any number of properties and columns to bind; it probably even allows 0.

MarkPflug commented 2 years ago

Select is expecting an array of string (string[]), you are providing just a string. You need to call DataTable.Load with the selected data.

headersAll="IP Invoice ID,CustomerNumber,Scanning Date,SupplierNumber,Supplier Name,Last Approver,InvoiceNumber,Order Number,Invoice Date,DueDate,OriginalInvoceCurrency,OriginalInvoceAmount,Sum in Accounting Currency,Transfer_Date,Voucher Number,PaymentDate,Current_Recipient,BaswareStatus,Attn"

string[] headersNeeded= new string[] { "SupplierNumber","CustomerNumber","InvoiceNumber","DueDate","OriginalInvoceAmount","OriginalInvoceCurrency","PaymentDate","BaswareStatus"};

                var schema = new SylCsv.CsvSchema(Schema.Parse(headersAll));
                var options = new SylCsv.CsvDataReaderOptions //check if get error as null possible
                {
                    Schema = schema,
                };
                using (SylCsv.CsvDataReader csvDataReader = SylCsv.CsvDataReader.Create(Setup.FilePath, options)) //options
                {
                    //check
                    DataTable = new DataTable();
                     var selectedReader = csvDataReader.Select(headersNeeded);
                    DataTable.Load(selectedReader); // data table should only get the selected columns.                  
                }
KMastalerz commented 2 years ago

Yup, I noticed that .Select needs and array, not string and Schema.Parse() needs a string (without spaces after comma) not an array. It seems to work, im trying to optimize it now :) Thanks.

Does CsvDataReader also has the option to read error as null?

KMastalerz commented 2 years ago

Ok, I did something and it seems to work.

I'm sort of a dumbass, so would you check if this makes sense, or it's a bad practice.

As a concrete class is needed because where T : new() was added to GetRecords<T> extension, I decided to create another extension for my purpose.

New types would have to be added to this extension when used, but I'm not sure if this is OK, as I have 0 experience in working with strong dev teams and was always on my own with VBA 🗡️.

I EDITED THIS SOLUTION BELOW 😃

        public static List<T>? GetList<T>(this DbDataReader _reader)
        {
            var binderOpts = new DataBinderOptions
            {
                InferColumnTypeFromMember = false,
                BindingMode = DataBindingMode.Any
            };
            DataBinder.Create<ExcelInvoice>(_reader, binderOpts);
            List<T>? pass;

            try
            {
                T value = (T)Activator.CreateInstance(typeof(T));

                /// for each extracted excel with sylvan, this will have to be passed
                switch (value)
                {
                    case ExcelInvoice excelInvoice:
                        List<ExcelInvoice> list = _reader.GetRecords<ExcelInvoice>().ToList();
                        pass = list.Cast<T>().ToList();
                        break;
                    default:
                        throw new Exception(String.Format("Case to return class: {0}, has to be added to extension.", typeof(T).Name));

                }

                return pass;
            }
            catch (Exception ex)
            {
                throw new Exception(String.Format("Error during data extraction from class {0}. Exited with error: {1}", typeof(T).Name, ex.Message));
            }

        }

And it's called from:

       /// <summary>
        /// get data from .csv, .xls, .xlsx, .xlsm, .xlsb. Call is then redirected accordingly
        /// </summary>
        public List<T>? GetData<T>()
        {
            //check setup
            if (string.IsNullOrWhiteSpace(Setup.FilePath)) throw new ArgumentNullException("please setup excel connection run");
            if (Setup.HeaderList == null) throw new ArgumentNullException("please setup excel connection run");
            if (Setup.HeaderList.Count == 0) throw new ArgumentNullException("please setup excel connection run");
            if (Setup.ColumnMapping == null) throw new ArgumentNullException("please setup excel connection run");
            if (Setup.ColumnMapping.Count == 0) throw new ArgumentNullException("please setup excel connection run");
            //if (Setup.TypeMapping == null) throw new ArgumentNullException("please setup excel connection run");
            //if (Setup.TypeMapping.Count == 0) throw new ArgumentNullException("please setup excel connection run");
            //Setup.HeaderList = (List<string>)Setup.HeaderList.Distinct();//remove potential duplicates

            try
            {
                string extension = Path.GetExtension(Setup.FilePath);
                List<T>? _list = new List<T>();

                switch (extension)
                {
                    case ".csv":
                        _list = SylvanReaderCSV<T>();
                        break;
                    default:
                        _list = SylvanReader<T>();
                        break;
                }
                //assign numbers
                RowCount = _list.Count();
                //collect file information
                GetMetadata(Setup.FilePath);
                //return list (even if null)
                return _list;
            }
            catch (Exception ex)
            {
                throw new Exception(ex.Message);
            }
        }
        /// <summary>
        /// sylvan reader for .xls, .xlsx, .xlsm, .xlsb
        /// </summary>
        private List<T>? SylvanReader<T>()
        {
            if (string.IsNullOrWhiteSpace(Setup.SheetName)) throw new ArgumentNullException("please setup excel connection run");

            try
            {

                string headerSchema = "";

                //build headers _headerList is List<string> containing headers
                foreach (var header in Setup.HeaderList)
                {
                    if (Setup.ColumnMapping.ContainsKey(header))
                    {
                        headerSchema += Setup.ColumnMapping[header];
                    }
                    else headerSchema += header + ", ";
                }
                //remove last 2 chars
                headerSchema = headerSchema.Substring(0, headerSchema.Length - 1);

                string[] parameters = Setup.ColumnMapping.Keys.ToArray();
                var options = new SylExcel.ExcelDataReaderOptions { Schema = SylExcel.ExcelSchema.NoHeaders, GetErrorAsNull = true };
                //collect data
                using (SylExcel.ExcelDataReader excelDataReader = SylExcel.ExcelDataReader.Create(Setup.FilePath, options))
                {
                    //loop to locate sheet
                    while (excelDataReader.WorksheetName != Setup.SheetName)
                    {
                        excelDataReader.NextResult();

                        if (excelDataReader.WorksheetName == null)
                        {
                            throw new Exception("didnt find the sheet");
                        }
                    }
                    //loop to find headers
                    for (int i = 0; i < Setup.StartRow; i++)
                    {
                        excelDataReader.Read();
                    }

                    // parse the schema, and use it to reinitialize the schema for the sheet.
                    var schema = Schema.Parse(headerSchema);
                    excelDataReader.InitializeSchema(schema.GetColumnSchema(), useHeaders: true);
                    var DataReader = excelDataReader.Select(parameters);
                    return DataReader.GetList<T>();
                }
            }
            catch(Exception ex)
            {
                throw new Exception(ex.Message);
            }
        }
        /// <summary>
        /// sylvan reader for .csv
        /// </summary>
        private List<T>? SylvanReaderCSV<T>()
        {
            try
            {
                List<string> headersList = new List<string>();
                List<string> neededList = new List<string>();
                //build headers
                foreach (var header in Setup.HeaderList)
                {
                    if (Setup.ColumnMapping.ContainsKey(header))
                    {
                        headersList.Add(Setup.ColumnMapping[header]);
                        neededList.Add(Setup.ColumnMapping[header]);
                    }
                    else headersList.Add(header);
                }

                string headersAll = string.Join(",", headersList.ToArray());
                string[] headersNeeded = neededList.ToArray();
                var schema = new SylCsv.CsvSchema(Schema.Parse(headersAll));
                var options = new SylCsv.CsvDataReaderOptions{  Schema = schema,};
                using (SylCsv.CsvDataReader csvDataReader = SylCsv.CsvDataReader.Create(Setup.FilePath, options)) //options
                {
                    var DataReader = csvDataReader.Select(headersNeeded);
                    return DataReader.GetList<T>();
                }

            }
            catch (Exception ex)
            {
                throw new Exception(ex.Message);
            }
        }

For a while I was thinking of somehow changing the Sylvan.dll to:

        public static IEnumerable<T> GetRecords<T>(this DbDataReader reader)
        {
            var binder = DataBinder.Create<T>(reader);
            while (reader.Read())
            {
                var item = (T)Activator.CreateInstance(typeof(T));
                binder.Bind(reader, item);
                yield return item;
            }
        }

but, I have zero idea if this would work, could you add some extension that does that, but it's named GetRecordsUnsafe :D? So I could test it?

EDIT:

ok, that was dumb on my part, I just had to implement this as an extension....

        public static IEnumerable<T> GetList<T>(this DbDataReader reader)
        {
            var binder = DataBinder.Create<T>(reader);
            while (reader.Read())
            {
                var item = (T)Activator.CreateInstance(typeof(T));
                binder.Bind(reader, item);
                yield return item;
            }
        }

and change two both excel and csv reader lines from: from:

                    var DataReader = csvDataReader.Select(headersNeeded);
                    return DataReader.GetList<T>();

to:

                    var DataReader = csvDataReader.Select(headersNeeded);
                    var binderOpts = new DataBinderOptions
                    {
                        InferColumnTypeFromMember = false,
                        BindingMode = DataBindingMode.Any
                    };
                    DataBinder.Create<ExcelInvoice>(DataReader, binderOpts);
                    return DataReader.GetList<T>().ToList();

this gave me back control of DataBinderOptions outside of the extension. And hey, it works like a charm :) no need to remember about the switch:)

PS: I know it's a false name as it's not returning list :)

Although it would be great to pass any mismatched forced datatype to as nullor skip it, I have an idea for a workaround using my previous extension and an additional class with the same names but all as of string :). The null assignment could be a workaround on CSV files, in companies, a lot of them are corrupted due to no restriction to populate a field in db, etc. with the same signs as a delimiter. And plenty of the tools often allows only to extract CSV making them a pain to work with.

Again, thanks for all the help :)

MarkPflug commented 2 years ago

Looks like you got it figured out then?

The only comment I'd make about your ToList method is that it appears that all you've done is replace my :new() constraint with Activator.CreateInstance. These should be roughly equivalent in behavior, however the :new() generic constraint will be enforced by the compiler, but Activator.CreateInstance is only enforced at runtime: it throws an exception if T doesn't have a parameterless constructor. When given the choice between a compile-time error and a runtime error, I'd definitely prefer a compile-time error; it proves that all invocations of your ToList extension method are correct and will work. Was the :new() constraint causing problems for you?

KMastalerz commented 2 years ago

Hello,

I'm stuck at passing null or skipping invalid data with CSV file.

Something in the line of:

var opts = new ExcelDataReaderOptions {
  GetErrorAsNull = true,
};

To skip whenever shift was done, due to additional tab in populated CSV.

As for the :new() vs Activator.CreateInstance. Extension i added myself was typed as i have written generic extract function. This way i pass extract in fewer lines of code. I have a class built in for each functionality, that represent needed structure to be added to SQL Server. Unfortunately im unable to get data from table to table or use original table :(. As to why i need List<T> the tool, needs to iterate and reformat data, then i have to go back to create DatTable based on this list and use BulkCopy.

It's usefull for me to have that, bu lack of it is not generally show stopper. For me is just so the code would be easier to read in the future. Considering all classes are already known at compile it's just for that.

Excel: (https://gist.github.com/KMastalerz/8e96b4f70e29add7794cdd514ef6e919) something i also did with dapper on SQL: https://gist.github.com/KMastalerz/c2cab8d2aa013a37f8c91a4d2cb7a910

After this all i have to is:

            //call function to get data
            //setup excel connection.
            //it's mandatory for this function.
            excelExtract.Setup.StartRow = UploadItem.FileStructure.StartRow;
            excelExtract.Setup.FilePath = UploadItem.ReportPath;
            excelExtract.Setup.RowHeader = "FileRow";
            excelExtract.Setup.SheetName = UploadItem.FileStructure.SheetName;
            excelExtract.Setup.HeaderList = UploadItem.FileStructure.HeaderList;
            excelExtract.Setup.ColumnMapping = UploadItem.FileStructure.ColumnList.ToDictionary(rec => rec.columnName, rec => rec.sqlColumn);
            excelExtract.Setup.TypeMapping = UploadItem.FileStructure.ColumnList.ToDictionary(rec => rec.sqlColumn, rec => rec.convertTo);

            //Additional line added to try extract
            //this is caused by incorrect data in csv files
            //that can have additional /t in field value
            //causing incorrect mapping of data.
            List<ExcelInvoice>? excelInvoices = null;

            try
            {
                excelInvoices = excelExtract.GetData<ExcelInvoice>();
            }
            catch (Exception ex)
            {
                throw new Exception(String.Format("Excel file {0}, could not be downloaded.\n{1}", UploadItem.FileTrueName, ex.Message));
            }

As you may be aware im fairly new, so possibly not the greates code, but it helps me.

MarkPflug commented 2 years ago

I'm stuck at passing null or skipping invalid data with CSV file.

The CSV reader doesn't support that directly. You can use the .Where extension method (from the Sylvan.Data package) to filter out the rows that don't meet your specific criteria:

using Sylvan.Data;
using Sylvan.Data.Csv;
using System.Data;
using System.Data.Common;

var schema = Schema.Parse("A:int,B:int,C:int,D:int");

const string BadCSV = "A,B,C,D\n1,2,3,4\n1,2,3,4,5\n1,2,3,4\nX,1,2,3\n";
var opts = new CsvDataReaderOptions { Schema = new CsvSchema(schema) };
var csvReader = CsvDataReader.Create(new StringReader(BadCSV), opts);

var badRowWriter = new StringWriter();

var filtered = csvReader.Where(r => FilterBadRows(r));

var data = new DataTable();
data.Load(filtered);

Console.WriteLine("BAD: ");
Console.WriteLine(badRowWriter.ToString());

Console.WriteLine("Done");

// this function is used to filter out the rows that match the schema
// a general-purpose method could be written that validates the data against the
// given schema, this example just hard-codes a couple checks as an example.
bool FilterBadRows(DbDataReader reader)
{
    bool good = true;

    // filter out the second row where there are 5 columns.
    if(csvReader.RowFieldCount != 4)
    {
        good = false;
    }
    var s = csvReader.GetString(0);
    // filter out the last row with "X" in the first column
    if(int.TryParse(s, out _) == false)
    {
        good = false;
    }
    if(!good)
    {
        // write bad rows to a separate output
        badRowWriter.Write(csvReader.GetRawRecordSpan());
    }

    return good;
}
KMastalerz commented 2 years ago

Hmm,

Do you know if DbDataReader supports LINQ? I was thinking I could write an extension method to replace columns if the data type is incorrect to write as null. Unfortunately, I cannot cancel the entire row. I'd like the opportunity, to mark it as an incomplete line, due to shift, so that users could check those cases in the source. There are not that many of them, but they are crashing the entire code.

Also, some of the rows, are not populated all the columns, wouldn't that remove also those lines, that's why I'm asking for the above possibility? to write something in the lines of

csvReader = csvReader.ForEach(r => ReplaceBadDateWithNull(r , indexcol)); and csvReader = csvReader.ForEach(r => ReplaceBadDecimalWithNull(r, indexcol)); and csvReader = csvReader.ForEach(r => ReplaceBadDoubleWithNull(r, indexcol));

or it only supports .Where(), to filter them out?

KMastalerz commented 2 years ago

Oh, one thing I meant to mention, sylvan.data doesn't recognize DbDataReader, do you use common?