JoshClose / CsvHelper

Library to help reading and writing CSV files
http://joshclose.github.io/CsvHelper/
Other
4.63k stars 1.05k forks source link

Rollback part of commit 7dbb3fe710466ec2f573f81e19c7130ec1c7a365 #2273

Open jzabroski opened 1 week ago

jzabroski commented 1 week ago

Describe the bug I don't know why the code is now throwing an exception when it used to return null - the commit where this occurred just says it was done to remove nullability - but why when that is very useful. There is no clear use case for this behavior, and it causes an exception to be thrown on an empty file, which is highly non-desirable behavior.

In any event, the lack of migration guide since v30 to now is also making upgrading difficult and questionable. We're on v31 but given the other bug I filed, I'm not planning to upgrade further.

To Reproduce https://github.com/JoshClose/CsvHelper/commit/7dbb3fe710466ec2f573f81e19c7130ec1c7a365#diff-e1cbba681f102065fada3e91fecbfbc458195a01974189a3483f75e72137afdaL733-L736

Fails with the following:

Empty.csv in CsvHelperBug.csproj

first,last,dob

SampleMap.cs

    public class SampleMap : ClassMap<SampleLine>
    {
        [SuppressMessage("ReSharper", "VirtualMemberCallInConstructor")]
        public SampleMap()
        {
            Map(x => x.FirstName).Name("first");
            Map(x => x.LastName).Name("last");
            Map(x => x.DateOfBirth).Name("dob");
        }
    }

SampleLine.cs

    public class SampleLine
    {
        public string FirstName { get; set; }

        public string LastName { get; set; }

        public DateTime? DateOfBirth { get; set; }
    }

CsvHelperBug/Test.cs

var resourceName = "CsvHelperBug.Empty.csv"; // Modify name based on what csproj your repro is called
using (var streamReader = new StreamReader(GetType().Assembly.GetManifestResourceStream(resourceName)))
{
    using (var csvReader = new CsvReader(streamReader, new CsvConfiguration(CultureInfo.CurrentCulture) { HasHeaderRecord = true }))
    {
        csvReader.Context.RegisterClassMap<SampleMap>();
        var results= new List<SampleLine>();
        while (await csvReader.ReadAsync())
        {
            var importLine = csvReader.GetRecord<SampleLine>();
            if (importLine == null)
            {
                //No record found
                continue;
            }
            results.Add(importLine);
        }
        return results;
    }
}

Additionally, the following fails with "No header record was found" (wtf?):

var resourceName = "CsvHelperBug.Empty.csv";
using (var streamReader = new StreamReader(GetType().Assembly.GetManifestResourceStream(resourceName)))
{
    using (var csvReader = new CsvReader(streamReader, new CsvConfiguration(CultureInfo.CurrentCulture) { HasHeaderRecord = true }))
    {
        csvReader.Context.RegisterClassMap<SampleMap>();
        csvReader.ReadHeader(); // throws CsvHelper.ReaderException: No header record was found.
    }
}

Expected behavior Roll this back! It doesn't make any sense in the perfectly valid scenario where there are no rows.

Screenshots n/a

Additional context n/a

JoshClose commented 1 week ago

The idea is, you should never have a null record. The issue here is ReadAsync should be returning false.

Does this sound reasonable?

As far as the ReadHeader thing, that code hasn't changed in years. You should be calling Read before ReadHeader, but there could probably be a better exception message.

jzabroski commented 1 week ago

That would work for me. Can you push a fix today

JoshClose commented 1 week ago

Working on it right now.

JoshClose commented 1 week ago

I have a dilemma.

The way things work is row by row. That means you call Read every time you want a new row. You can then act on the given row. If you have a header record, you're supposed to call ReadHeader, which reads the current row as headers. Then you call you Read again to get to the first record. This allows the user to handle each row however they like. When using GetRecords, everything is handled for you.

The issue here is that GetRecord is calling ReadHeader for you if you haven't called it yet. That was a convenience for people, but presents a problem here, because there might not be another row. Returning null is a bad way to handle that. When enabling nullable it becomes a pain to handle null records when there is almost no chance of getting one.

I think the correct solution here is to have GetRecord stop calling ReadHeader and Read for you. This may be a little annoying, but it makes things consistent.

csv.Read();
csv.ReadHeader();
while (csv.Read())
{
    var record = csv.GetRecord<Foo>();
}

The other option is to have Read call ReadHeader and read a second time, returning false when there isn't another record. I don't like this idea because it doesn't allow the user to do any other inspection on the header record and automatically goes past it. If you're using Read instead of GetRecords, you've opted into a more manual way of reading.

What do you think?

jzabroski commented 1 week ago

I think if CsvConfiguration specifies there is a header record, it should do it for you.

ReadHeader should add a remark that it should only be used for highly customized or niche scenarios, and is primarily intended for internal use when CsvConfiguration.HasHeaderRecord is true.

Otherwise, what is the point of the CsvConfiguration.HasHeaderRecord property?

JoshClose commented 1 week ago

what is the point of the CsvConfiguration.HasHeaderRecord property?

Only used for GetRecords.

jzabroski commented 1 week ago

what is the point of the CsvConfiguration.HasHeaderRecord property?

Only used for GetRecords.

Then ReadHeader should add a remarks that it requires CsvConfiguration.HasHeaderRecord to be true, but allows the consumer to directly read.

And yes, your Read fix should work.

In my use case, we read and validate one row at a time with FluentValidation, so that we can parse a batch of records in really large CSV files, validate it, and save to a db only the good rows and report the bad rows. Otherwise I'm not certain we would need to read one row at a time.

A lot of financial data is zipped csv files, so we deflate, which is memory intensive, and to avoid running out of memory, as well as scaling db insert/update/delete operations, we process in batches of 1000.