MarkPflug / Sylvan

A collection of .NET libraries, including the fastest general-purpose CSV parser for .NET.
MIT License
375 stars 38 forks source link

CsvDataReader loses track of quoted field when the first line is already read #258

Closed joelverhagen closed 4 days ago

joelverhagen commented 4 days ago

Hey Mark! Found this on NuGet Insights.

Problem: When the underlying text reader has ReadLine called before parsing, a quoted field gets corrupted. I call ReadLine first to validate headers against the expected (loose schema validation).

This works for most fields, but when there is a lot of quoting, it seems to cause a problem.

Minimal repro:

using Sylvan.Data.Csv;

// does not work
using var textReader = new StringReader(
    """
    ScanId,ScanTimestamp,LowerId,Identity,Id,Version,CatalogCommitTimestamp,Created,ResultType,SequenceNumber,Path,FileName,FileExtension,TopLevelFolder,FileLength,EdgeCases,AssemblyName,AssemblyVersion,Culture,PublicKeyToken,HashAlgorithm,HasPublicKey,PublicKeyLength,PublicKeySHA1,CustomAttributes,CustomAttributesFailedDecode,CustomAttributesTotalCount,CustomAttributesTotalDataLength
    d2971696-e6de-4f9f-a22e-abfa043cd436,2024-09-13T18:11:03.5793073Z,codat.lending,codat.lending/6.0.1,Codat.Lending,6.0.1,2024-09-13T13:36:38.025233Z,2024-09-13T13:33:27.59Z,ValidAssembly,2,lib/net8.0/Codat.Lending.dll,Codat.Lending.dll,.dll,lib,1313792,None,Codat.Lending,6.0.1.0,,,Sha1,false,,,"{""AssemblyCompany"":[{""0"":""Codat""}],""AssemblyConfiguration"":[{""0"":""Release""}],""AssemblyCopyright"":[{""0"":""Copyright (c) Codat 2024""}],""AssemblyDescription"":[{""0"":""Lending API: Our Lending API helps you make smarter credit decisions on small businesses by enabling you to pull your customers\u0027 latest data from accounting, banking, and commerce software they are already using. It also includes features to help providers verify the accuracy of data and process it more efficiently.\n\nThe Lending API is built on top of the latest accounting, commerce, and banking data, providing you with the most important data points you need to get a full picture of SMB creditworthiness and make a comprehensive assessment of your customers.\n\n[Explore product](https://docs.codat.io/lending/overview) | [See OpenAPI spec](https://github.com/codatio/oas)\n\n\n## Endpoints\n\n| Endpoints | Description |\n| :- |:- |\n| Companies | Create and manage your SMB users\u0027 companies. |\n| Connections | Create new and manage existing data connections for a company. |\n| Bank statements | Retrieve banking data from linked bank accounts. |\n| Sales | Retrieve standardized sales data from a linked commerce software. |\n| Financial statements | Financial data and reports from a linked accounting software. |\n| Liabilities | Debt and other liabilities. |\n| Accounts payable | Data from a linked accounting software representing money the business owes money to its suppliers. |\n| Accounts receivable | Data from a linked accounting software representing money owed to the business for sold goods or services. |\n| Transactions | Data from a linked accounting software representing transactions. |\n| Company info | View company information fetched from the source platform. |\n| Data integrity | Match mutable accounting data with immutable banking data to increase confidence in financial data. |\n| Excel reports | Download reports in Excel format. |\n| Manage data | Control how data is retrieved from an integration. |\n| File upload | Endpoints to manage uploaded files. |\n| Loan writeback | Implement the [loan writeback](https://docs.codat.io/lending/guides/loan-writeback/introduction) procedure in your lending process to maintain an accurate position of a loan during the entire lending cycle. |\n""}],""AssemblyFileVersion"":[{""0"":""6.0.1.0""}],""AssemblyInformationalVersion"":[{""0"":""6.0.1\u002B6e8902a011e9497bb875cb70f78fefc2f692b1f3""}],""AssemblyMetadata"":[{""0"":""RepositoryUrl"",""1"":""https://github.com/codatio/client-sdk-csharp.git""}],""AssemblyProduct"":[{""0"":""Codat.Lending""}],""AssemblyTitle"":[{""0"":""Codat.Lending""}],""CompilationRelaxations"":[{""0"":8}],""Debuggable"":[{""0"":2}],""Extension"":[{}],""RuntimeCompatibility"":[{""WrapNonExceptionThrows"":true}],""TargetFramework"":[{""0"":"".NETCoreApp,Version=v8.0"",""FrameworkDisplayName"":"".NET 8.0""}]}",[],14,2437
    """);

/*
// works
using var textReader = new StringReader(
    """
    ScanId,ScanTimestamp,LowerId,Identity,Id,Version,CatalogCommitTimestamp,Created,ResultType,SequenceNumber,Path,FileName,FileExtension,TopLevelFolder,FileLength,EdgeCases,AssemblyName,AssemblyVersion,Culture,PublicKeyToken,HashAlgorithm,HasPublicKey,PublicKeyLength,PublicKeySHA1,CustomAttributes,CustomAttributesFailedDecode,CustomAttributesTotalCount,CustomAttributesTotalDataLength
    d2971696-e6de-4f9f-a22e-abfa043cd436,2024-09-13T18:11:03.5793073Z,codat.lending,codat.lending/6.0.1,Codat.Lending,6.0.1,2024-09-13T13:36:38.025233Z,2024-09-13T13:33:27.59Z,ValidAssembly,2,lib/net8.0/Codat.Lending.dll,Codat.Lending.dll,.dll,lib,1313792,None,Codat.Lending,6.0.1.0,,,Sha1,false,,,"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",[],14,2437
    """);
*/

textReader.ReadLine();

using var csv = CsvDataReader.Create(textReader, new CsvDataReaderOptions { HasHeaders = false });

while (await csv.ReadAsync())
{
    Console.WriteLine("Line " + csv.RowNumber);
    for (var i = 0; i < csv.FieldCount; i++)
    {
        Console.WriteLine("  " + ((string)csv[i]));
    }
}

Expected output:

Line 1
  d2971696-e6de-4f9f-a22e-abfa043cd436
  2024-09-13T18:11:03.5793073Z
  codat.lending
  codat.lending/6.0.1
  Codat.Lending
  6.0.1
  2024-09-13T13:36:38.025233Z
  2024-09-13T13:33:27.59Z
  ValidAssembly
  2
  lib/net8.0/Codat.Lending.dll
  Codat.Lending.dll
  .dll
  lib
  1313792
  None
  Codat.Lending
  6.0.1.0

  Sha1
  false

  {"AssemblyCompany":[{"0":"Codat"}],"AssemblyConfiguration":[{"0":"Release"}],"AssemblyCopyright":[{"0":"Copyright (c) Codat 2024"}],"AssemblyDescription":[{"0":"Lending API: Our Lending API helps you make smarter credit decisions on small businesses by enabling you to pull your customers\u0027 latest data from accounting, banking, and commerce software they are already using. It also includes features to help providers verify the accuracy of data and process it more efficiently.\n\nThe Lending API is built on top of the latest accounting, commerce, and banking data, providing you with the most important data points you need to get a full picture of SMB creditworthiness and make a comprehensive assessment of your customers.\n\n[Explore product](https://docs.codat.io/lending/overview) | [See OpenAPI spec](https://github.com/codatio/oas)\n\n\n## Endpoints\n\n| Endpoints | Description |\n| :- |:- |\n| Companies | Create and manage your SMB users\u0027 companies. |\n| Connections | Create new and manage existing data connections for a company. |\n| Bank statements | Retrieve banking data from linked bank accounts. |\n| Sales | Retrieve standardized sales data from a linked commerce software. |\n| Financial statements | Financial data and reports from a linked accounting software. |\n| Liabilities | Debt and other liabilities. |\n| Accounts payable | Data from a linked accounting software representing money the business owes money to its suppliers. |\n| Accounts receivable | Data from a linked accounting software representing money owed to the business for sold goods or services. |\n| Transactions | Data from a linked accounting software representing transactions. |\n| Company info | View company information fetched from the source platform. |\n| Data integrity | Match mutable accounting data with immutable banking data to increase confidence in financial data. |\n| Excel reports | Download reports in Excel format. |\n| Manage data | Control how data is retrieved from an integration. |\n| File upload | Endpoints to manage uploaded files. |\n| Loan writeback | Implement the [loan writeback](https://docs.codat.io/lending/guides/loan-writeback/introduction) procedure in your lending process to maintain an accurate position of a loan during the entire lending cycle. |\n"}],"AssemblyFileVersion":[{"0":"6.0.1.0"}],"AssemblyInformationalVersion":[{"0":"6.0.1\u002B6e8902a011e9497bb875cb70f78fefc2f692b1f3"}],"AssemblyMetadata":[{"0":"RepositoryUrl","1":"https://github.com/codatio/client-sdk-csharp.git"}],"AssemblyProduct":[{"0":"Codat.Lending"}],"AssemblyTitle":[{"0":"Codat.Lending"}],"CompilationRelaxations":[{"0":8}],"Debuggable":[{"0":2}],"Extension":[{}],"RuntimeCompatibility":[{"WrapNonExceptionThrows":true}],"TargetFramework":[{"0":".NETCoreApp,Version=v8.0","FrameworkDisplayName":".NET 8.0"}]}
  []
  14
  2437

Actual output:

Line 1
  d2971696-e6de-4f9f-a22e-abfa043cd436,2024-09-13T18:11:03.5793073Z,codat.lending,codat.lending/6.0.1,Codat.Lending,6.0.1,2024-09-13T13:36:38.025233Z,2024-09-13T13:33:27.59Z,ValidAssembly,2,lib/net8.0/Codat.Lending.dll,Codat.Lending.dll,.dll,lib,1313792,None,Codat.Lending,6.0.1.0,,,Sha1,false,,,"{""AssemblyCompany"":[{""0"":""Codat""}],""AssemblyConfiguration"":[{""0"":""Release""}],""AssemblyCopyright"":[{""0"":""Copyright (c) Codat 2024""}],""AssemblyDescription"":[{""0"":""Lending API: Our Lending API helps you make smarter credit decisions on small businesses by enabling you to pull your customers\u0027 latest data from accounting, banking, and commerce software they are already using. It also includes features to help providers verify the accuracy of data and process it more efficiently.\n\nThe Lending API is built on top of the latest accounting, commerce, and banking data, providing you with the most important data points you need to get a full picture of SMB creditworthiness and make a comprehensive assessment of your customers.\n\n[Explore product](https://docs.codat.io/lending/overview)
   [See OpenAPI spec](https://github.com/codatio/oas)\n\n\n## Endpoints\n\n
   Endpoints
   Description
  \n
   :-
  :-
  \n
   Companies
   Create and manage your SMB users\u0027 companies.
  \n
   Connections
   Create new and manage existing data connections for a company.
  \n
   Bank statements
   Retrieve banking data from linked bank accounts.
  \n
   Sales
   Retrieve standardized sales data from a linked commerce software.
  \n
   Financial statements
   Financial data and reports from a linked accounting software.
  \n
   Liabilities
   Debt and other liabilities.
  \n
   Accounts payable
   Data from a linked accounting software representing money the business owes money to its suppliers.
  \n
   Accounts receivable
   Data from a linked accounting software representing money owed to the business for sold goods or services.
  \n
   Transactions
   Data from a linked accounting software representing transactions.
  \n
   Company info
   View company information fetched from the source platform.
  \n
   Data integrity
   Match mutable accounting data with immutable banking data to increase confidence in financial data.
  \n
   Excel reports
   Download reports in Excel format.
  \n
   Manage data
   Control how data is retrieved from an integration.
  \n
   File upload
   Endpoints to manage uploaded files.
  \n
   Loan writeback
   Implement the [loan writeback](https://docs.codat.io/lending/guides/loan-writeback/introduction) procedure in your lending process to maintain an accurate position of a loan during the entire lending cycle.
  \n""}],""AssemblyFileVersion"":[{""0"":""6.0.1.0""}],""AssemblyInformationalVersion"":[{""0"":""6.0.1\u002B6e8902a011e9497bb875cb70f78fefc2f692b1f3""}],""AssemblyMetadata"":[{""0"":""RepositoryUrl"",""1"":""https://github.com/codatio/client-sdk-csharp.git""}],""AssemblyProduct"":[{""0"":""Codat.Lending""}],""AssemblyTitle"":[{""0"":""Codat.Lending""}],""CompilationRelaxations"":[{""0"":8}],""Debuggable"":[{""0"":2}],""Extension"":[{}],""RuntimeCompatibility"":[{""WrapNonExceptionThrows"":true}],""TargetFramework"":[{""0"":"".NETCoreApp,Version=v8.0"",""FrameworkDisplayName"":"".NET 8.0""}]}",[],14,2437
MarkPflug commented 4 days ago

It looks like a bug in my delimiter detecting code. If you specify the delimiter explicitly Delimiter = ',' it works correctly. It looks like with this input it is detecting the vertical bar |. My algorithm is rather simple, in that it counts the frequency of candidate delimiters in the first line of the file without considering quoting. Do you actually need the delimiter detector feature, or can you always assume comma?

MarkPflug commented 4 days ago

FWIW, you can access headers before calling Read(). I read the first line internally do determine the headers, so you can use GetName(int ordinal), or GetColumnSchema() if you need to see what columns are in the file before processing.

joelverhagen commented 4 days ago

It looks like a bug in my delimiter detecting code. If you specify the delimiter explicitly Delimiter = ',' it works correctly. It looks like with this input it is detecting the vertical bar |. My algorithm is rather simple, in that it counts the frequency of candidate delimiters in the first line of the file without considering quoting. Do you actually need the delimiter detector feature, or can you always assume comma?

Wow, I didn't realize delimiter detection was happening! I would much rather be explicit anyway and set Delimiter = ','. I'll do that. Thanks for the tip.

FWIW, you can access headers before calling Read(). I read the first line internally do determine the headers, so you can use GetName(int ordinal), or GetColumnSchema() if you need to see what columns are in the file before processing.

My assertion works on the full header line (first line == expected line) so I'd need to join on comma or (more technically correct) serialize the string array to a CSV row to handle escaping. I was starting to go down that route but I think I prefer being lazy and using the explicit delimiter.

I am surprised the default behavior is delimiter auto-detection, instead of opting into that behavior by setting Delimiter = null or something. No big deal though. I hope others don't run into a wrong detection.

joelverhagen commented 4 days ago

Thanks for your fast response!

MarkPflug commented 4 days ago

Wow, I didn't realize delimiter detection was happening!

My documentation might suck, but that's one thing I actually call out. :P

My assertion works on the full header line

You can get the entire header line (record) by calling GetRawRecordSpan() after calling Create and before calling Read for the first time. The span will include the line ending character(s) though, so calling ReadLine on the TextReader might be easier.

Anyway, sounds like you got it figured out.

joelverhagen commented 4 days ago

Yup, I'm good to go. Thanks so much!