Issue when parsing a file with mixed EOL

maros commented 1 month ago

I have a slightly broken file with mixed EOL (\n, \r\n and only \r - according to https://metacpan.org/pod/Text::CSV_XS#eol this should be handled automatically) that causes the parts of the file to be skipped silently. No error is reported.

use 5.038;
use Text::CSV_XS;

my $csv = Text::CSV_XS->new ({
    sep_char => ';',
});

# Write test data
open my $fhw, '>','csvtest.csv';
print $fhw "Antelope;snort\r\n";
print $fhw "Badger;growl\n"; # only newline
print $fhw "Bat;screech\r\n";
print $fhw "Bear;roar\r"; # only carriage return - no newline
print $fhw "Bee;buzz\r\n";
print $fhw "Camel;grunt\r\n";
print $fhw "Crow;caw\n"; # only newline
print $fhw "Deer;bellow\r\n";
print $fhw "Dolphin;click\r\n";
$fhw->close;

# Read test data
open my $fhr, '<','csvtest.csv';
while (my $row = $csv->getline($fhr)) {
    say sprintf '%s makes %s',($row->[0]||'-'),($row->[1]||'-');
}

# Remove test data
unlink 'csvtest.csv';

the output of the given example is

Antelope makes snort
Badger makes growl
Bat makes screech
Bear makes roar
Bee makes buzz
- makes -
Camel makes grunt
- makes -
- makes -
- makes -

There is an empty row between 'Bee' and 'Camel' which shouldn't be there, as well as all rows after 'Camel' are being skipped.

Tested with Text::CSV_XS 1.56 and perl 5.40.0

Tux commented 1 week ago

This one is relatively easy to expain, and maybe requires additional documentation. I value your feedback on this.

What the docs say is that \r, \r\n, and \n are all valid, which still holds, but what happens inside is that the parser "remembers" the found EOL for efficiency. The workaround in your case is to make the parser "forget" what it found:

use 5.038;
use Text::CSV_XS;

my $csv = Text::CSV_XS->new ({
    sep_char  => ";",
    auto_diag => 1,
    });

my $tfn = "issue-59.csv";

# Write test data
open my $fh, ">", $tfn or die "$tfn: $!\n";
print   $fh "Antelope;snort\r\n";
print   $fh "Badger;growl\n"; # only newline
print   $fh "Bat;screech\r\n";
print   $fh "Bear;roar\r"; # only carriage return - no newline
print   $fh "Bee;buzz\r\n";
print   $fh "Camel;grunt\r\n";
print   $fh "Crow;caw\n"; # only newline
print   $fh "Deer;bellow\r\n";
print   $fh "Dolphin;click\r\n";
close   $fh;

# Read test data
open $fh, "<", $tfn or die "$tfn: $!\n";
while (my $row = $csv->getline ($fh)) {
    printf "%s makes %s\n", map { $_ || "-" } $row->[0], $row->[1];
    $csv->eol (undef); # Forget what EOL it found
    }

# Remove test data
unlink $tfn;

→

Antelope makes snort
Badger makes growl
Bat makes screech
Bear makes roar
Bee makes buzz
Camel makes grunt
Crow makes caw
Deer makes bellow
Dolphin makes click

Options might be

document current behavior better, possibly with above example
have a new option eol_nocache defaulting to false and possibly slowing down parsing when enabled
make no-cache the default

I don't think that last option is viable, as it is backwards incompatible and might break existing scripts I don't know what the impact (on speed) would for that new option

Tux commented 1 week ago

And sorry for the late reply. Been busy

Tux / Text-CSV_XS

Issue when parsing a file with mixed EOL #59