Tux / Text-CSV_XS

perl5 module for composition and decomposition of comma-separated values
17 stars 20 forks source link

Issue when parsing a file with mixed EOL #59

Open maros opened 1 month ago

maros commented 1 month ago

I have a slightly broken file with mixed EOL (\n, \r\n and only \r - according to https://metacpan.org/pod/Text::CSV_XS#eol this should be handled automatically) that causes the parts of the file to be skipped silently. No error is reported.

use 5.038;
use Text::CSV_XS;

my $csv = Text::CSV_XS->new ({
    sep_char => ';',
});

# Write test data
open my $fhw, '>','csvtest.csv';
print $fhw "Antelope;snort\r\n";
print $fhw "Badger;growl\n"; # only newline
print $fhw "Bat;screech\r\n";
print $fhw "Bear;roar\r"; # only carriage return - no newline
print $fhw "Bee;buzz\r\n";
print $fhw "Camel;grunt\r\n";
print $fhw "Crow;caw\n"; # only newline
print $fhw "Deer;bellow\r\n";
print $fhw "Dolphin;click\r\n";
$fhw->close;

# Read test data
open my $fhr, '<','csvtest.csv';
while (my $row = $csv->getline($fhr)) {
    say sprintf '%s makes %s',($row->[0]||'-'),($row->[1]||'-');
}

# Remove test data
unlink 'csvtest.csv';

the output of the given example is

Antelope makes snort
Badger makes growl
Bat makes screech
Bear makes roar
Bee makes buzz
- makes -
Camel makes grunt
- makes -
- makes -
- makes -

There is an empty row between 'Bee' and 'Camel' which shouldn't be there, as well as all rows after 'Camel' are being skipped.

Tested with Text::CSV_XS 1.56 and perl 5.40.0

Tux commented 1 week ago

This one is relatively easy to expain, and maybe requires additional documentation. I value your feedback on this.

What the docs say is that \r, \r\n, and \n are all valid, which still holds, but what happens inside is that the parser "remembers" the found EOL for efficiency. The workaround in your case is to make the parser "forget" what it found:

use 5.038;
use Text::CSV_XS;

my $csv = Text::CSV_XS->new ({
    sep_char  => ";",
    auto_diag => 1,
    });

my $tfn = "issue-59.csv";

# Write test data
open my $fh, ">", $tfn or die "$tfn: $!\n";
print   $fh "Antelope;snort\r\n";
print   $fh "Badger;growl\n"; # only newline
print   $fh "Bat;screech\r\n";
print   $fh "Bear;roar\r"; # only carriage return - no newline
print   $fh "Bee;buzz\r\n";
print   $fh "Camel;grunt\r\n";
print   $fh "Crow;caw\n"; # only newline
print   $fh "Deer;bellow\r\n";
print   $fh "Dolphin;click\r\n";
close   $fh;

# Read test data
open $fh, "<", $tfn or die "$tfn: $!\n";
while (my $row = $csv->getline ($fh)) {
    printf "%s makes %s\n", map { $_ || "-" } $row->[0], $row->[1];
    $csv->eol (undef); # Forget what EOL it found
    }

# Remove test data
unlink $tfn;

Antelope makes snort
Badger makes growl
Bat makes screech
Bear makes roar
Bee makes buzz
Camel makes grunt
Crow makes caw
Deer makes bellow
Dolphin makes click

Options might be

I don't think that last option is viable, as it is backwards incompatible and might break existing scripts I don't know what the impact (on speed) would for that new option

Tux commented 1 week ago

And sorry for the late reply. Been busy