JoshClose / CsvHelper

Library to help reading and writing CSV files
http://joshclose.github.io/CsvHelper/
Other
4.76k stars 1.07k forks source link

Handling bad CSV records in CsvHelper #803

Closed zak-nye closed 7 years ago

zak-nye commented 7 years ago

Also posted on StackOverflow

I would like to be able to iterate through all records in a CSV file and add all the good records to one collection and handle all the "bad" ones separately. I don't seem to be able to do this and I think I must be missing something.

If I attempt to catch the BadDataException then subsequent reads will fail meaning I cannot carry on and read the rest of the file -

while (true)
{
    try
    {
        if (!reader.Read())
            break;

        var record = reader.GetRecord<Record>();
        goodList.Add(record);
    }
    catch (BadDataException ex)
    {
        // Exception is caught but I won't be able to read further rows in file
        // (all further reader.Read() result in same exception thrown)
        Console.WriteLine(ex.Message);
    }
}

The other option discussed is setting the BadDataFound callback action to handle it -

reader.Configuration.BadDataFound = x =>
{
    Console.WriteLine($"Bad data: <{x.RawRecord}>");
};

However although the callback is called the bad record still ends up in my "good list"

Is there some way I can query the reader to see if the record is good before adding it to my list?

For this example my Record definition is -

class Record
{
    public string FirstName { get; set; }
    public string LastName { get; set; }
    public int Age { get; set; }
}

And the data (first row bad, second row good) -

"Jo"hn","Doe",43
"Jane","Doe",21

Interestingly handling a missing field with MissingFieldException seems to function exactly as I would like - the exception is thrown but subsequent rows are still read ok.

JoshClose commented 7 years ago

You could do something like this.

void Main()
{
    using (var stream = new MemoryStream())
    using (var writer = new StreamWriter(stream))
    using (var reader = new StreamReader(stream))
    using (var csv = new CsvReader(reader))
    {
        writer.WriteLine("FirstName,LastName");
        writer.WriteLine("\"Jon\"hn\"\",\"Doe\"");
        writer.WriteLine("\"Jane\",\"Doe\"");
        writer.Flush();
        stream.Position = 0;

        var good = new List<Test>();
        var bad = new List<string>();
        var isRecordBad = false;
        csv.Configuration.BadDataFound = context =>
        {
            isRecordBad = true;
            bad.Add(context.RawRecord);
        };
        while (csv.Read())
        {
            var record = csv.GetRecord<Test>();
            if (!isRecordBad)
            {
                good.Add(record);
            }

            isRecordBad = false;
        }

        good.Dump();
        bad.Dump();
    }
}

public class Test
{
    public string FirstName { get; set; }
    public string LastName { get; set; }
}
bachph commented 5 years ago

Isn't it enough to check for IsFieldBad?

csv.Configuration.BadDataFound = context =>
{
    bad.Add(context.RawRecord);
};
while (csv.Read() && !csvReader.Context.IsFieldBad)
{
    good.Add(csv.GetRecord<Test>());
}

Then you do not have to capture the isRecordBad flag and the code looks a bit cleaner.

matiaspecile commented 4 years ago

I can start a new question if preferred, but it seemed appropriate here:

what is the proper way to deal with this if client's implementation uses GetRecords<T>() instead of a while loop with GetRecord<T>()

JoshClose commented 4 years ago

@bachph Wouldn't that stop reading when a bad field is hit?

JoshClose commented 4 years ago

@matiaspecile I don't think you can. GetRecords will return both good and bad.

matiaspecile commented 4 years ago

@JoshClose thanks for the reply.

Per the docs, I was able to suppress the exception by nulling BadDataFound. If I assign a method to this property at all, it throws asking if I meant to use GetRecords instead of GetRecord

I'm populating MemberMaps dynamically, so I don't think I can use GetRecord .. is this accurate, or is there something I'm missing?

If, in the worst of cases, I can't handle bad rows, how dangerous is it for me to just null BadDataFound? From your reply, I gather that I will receive the bad rows, so is it possible the records parsed from these rows will have been parsed incorrectly?

thanks again for your insight,

JoshClose commented 4 years ago

You should always be able to use GetRecord. GetRecords just loops that.

Do you have bad rows that you need to worry about? It's for malformed files. Hopefully you shouldn't need to use it.

matiaspecile commented 4 years ago

unfortunately, yes. it's the first instance amongst thousands of files parsed successfully, so at least the rate is very low.

I will try again with GetRecord looped. thanks very much for the clarity on this.

CoderHead commented 4 years ago

This is a really great answer, but I'm finding in version 12.3.2 when a bad record is encountered the package throws an exception instead of hitting the BadDataFound delegate. To be clear, the bad record I'm using to test is the last row and is completely malformed (comma instead of pipe, two cols instead of three) and the rest of the file is parsed fine. This is a business requirement for CSV files that are manually generated and therefore prone to human error.

SamplePipeDelimitedInput.txt

bonzai95 commented 4 years ago

See what kind of Exception is thrown. You might need to give your own delegate function to it like this...

            csv.Configuration.MissingFieldFound = (s, i, context) => {
                isRecordedBad = true;
                bad.Add(context.RawRecord);
            };

            csv.Configuration.BadDataFound = context => {
                isRecordedBad = true;
                bad.Add(context.RawRecord);
            };

The docs say BadDataFound is specifically

"A field has bad data if it contains a quote and the field is not quoted (escaped)"

saadk01 commented 3 years ago

You could do something like this.

void Main()
{
  using (var stream = new MemoryStream())
  using (var writer = new StreamWriter(stream))
  using (var reader = new StreamReader(stream))
  using (var csv = new CsvReader(reader))
  {
      writer.WriteLine("FirstName,LastName");
      writer.WriteLine("\"Jon\"hn\"\",\"Doe\"");
      writer.WriteLine("\"Jane\",\"Doe\"");
      writer.Flush();
      stream.Position = 0;

      var good = new List<Test>();
      var bad = new List<string>();
      var isRecordBad = false;
      csv.Configuration.BadDataFound = context =>
      {
          isRecordBad = true;
          bad.Add(context.RawRecord);
      };
      while (csv.Read())
      {
          var record = csv.GetRecord<Test>();
          if (!isRecordBad)
          {
              good.Add(record);
          }

          isRecordBad = false;
      }

      good.Dump();
      bad.Dump();
  }
}

public class Test
{
  public string FirstName { get; set; }
  public string LastName { get; set; }
}

For anyone looking, to make Josh's example work in version 21, the following worked for me without any change in output:

using (var reader = new StreamReader(csvFileStream))`
 {
      var malformedRow = false;
      var conf = new CsvConfiguration(CultureInfo.InvariantCulture)
      {
           // If you want to change delimiter.
          Delimiter = "\t";
          BadDataFound = context =>
          {
              malformedRow = true;
              // Do what you need to do with the malformed row. For example:
              errorRecsCollection.Add(context.Parser.RawRecord);
          }      
       };
      using (var csvReader = new CsvReader(reader, conf))
      {
           `// No change required here.`
       }
}
snympi commented 3 years ago

@saadk01 thanks. This was the only reference I could find explaining how to use BadDataFound because the usual CsvReader.Configuration.BadDataFound event is now read-only and cannot be updated.

nafberger commented 2 years ago

@JoshClose seems to me that the issue it not resolved at all. GetRecords so far works only with super clean records no errors allowed, which is unreasonable. so the only way is to use GetRecord in a try catch. far from optimal