doy / spreadsheet-parsexlsx

parse XLSX files
http://metacpan.org/release/Spreadsheet-ParseXLSX
27 stars 35 forks source link

Missing functionality #15

Closed Tux closed 10 years ago

Tux commented 10 years ago

I wanted to replace Spreadsheet:XLSX with Spreadsheet:ParseXLSX in Spreadsheet::Read

  1. Spreadsheet::XLSX is dead
  2. Spreadsheet::ParseXLSX claims to be compatible with Spreadsheet::ParseExcel

I derive 2. from this note:

DESCRIPTION This module is an adaptor for Spreadsheet::ParseExcel that reads XLSX files.

It is however not a drop-in replacement

my $oFmt = Spreadsheet::ParseXLSX::FmtDefault->new;

is not available

doy commented 10 years ago

So I'm curious what you're actually expecting this to do. Spreadsheet::ParseXLSX->new actually returns a Spreadsheet::ParseExcel object, so I think just continuing to use Spreadsheet::ParseExcel::FmtDefault->new should work fine - it's what Spreadsheet::ParseXLSX does internally, anyway. I don't know that duplicating all of the internal classes that Spreadsheet::ParseExcel uses is necessarily a useful idea. The idea behind being a drop in adaptor is that the parser is different, but the classes that it ends up generating are still the same, so I think if you want one of those classes, you should just use it directly. If you want to submit a documentation patch for that, I'd certainly be open to that.

Tux commented 10 years ago

On Fri, 17 Jan 2014 16:05:33 -0800, Jesse Luehrs notifications@github.com wrote:

So I'm curious what you're actually expecting this to do.

Spreadsheet::Read is a wrapper module over spreadsheet parsers to feature a generic frontend.

    use Spreadsheet::Read;
    my $book  = ReadData ("test.csv", sep => ";");
    my $book  = ReadData ("test.sxc");
    my $book  = ReadData ("test.ods");
    my $book  = ReadData ("test.xls");
    my $book  = ReadData ("test.xlsx");
    my $book  = ReadData ($fh, parser => "xls");

Inside Read.pm, I use code like this:

my @parsers = ( [ csv => "Text::CSV_XS" ], [ csv => "Text::CSV_PP" ], # Version 1.05 and up [ csv => "Text::CSV" ], # Version 1.00 and up [ ods => "Spreadsheet::ReadSXC" ], [ sxc => "Spreadsheet::ReadSXC" ], [ xls => "Spreadsheet::ParseExcel" ], [ xlsx => "Spreadsheet::XLSX" ], [ prl => "Spreadsheet::Perl" ],

out of that list it picks the first available parser for a format. My goal now was to replace to dead Spreadsheet::XLSX with Spreadsheet::ParseXLSX. As the latter looks like being API compliant with Spreadsheet::ParseExcel, I change code like

    if ($io_ref) {
        $oBook = $parse_type eq "XLSX"
            ? Spreadsheet::XLSX->new ($io_ref)
            : Spreadsheet::ParseExcel->new (%parser_opts)->Parse ($io_ref);
        }
    else {
        $oBook = $parse_type eq "XLSX"
            ? Spreadsheet::XLSX->new ($txt)
            : Spreadsheet::ParseExcel->new (%parser_opts)->Parse ($txt);
        }

to

    if ($io_ref) {
        $oBook = $parse_type eq "XLSX"
            ? $can{xlsx} =~ m/::XLSX/
            ? Spreadsheet::XLSX->new ($io_ref)
            : Spreadsheet::ParseXLSX->new (%parser_opts)->parse ($io_ref)
            : Spreadsheet::ParseExcel->new (%parser_opts)->Parse ($io_ref);
        }
    else {
        $oBook = $parse_type eq "XLSX"
            ? $can{xlsx} =~ m/::XLSX/
            ? Spreadsheet::XLSX->new ($txt)
            : Spreadsheet::ParseXLSX->new (%parser_opts)->parse ($txt)
            : Spreadsheet::ParseExcel->new (%parser_opts)->Parse ($txt);
        }

assuming it just works

Spreadsheet::ParseXLSX->new actually returns a Spreadsheet::ParseExcel object, so I think just continuing to use Spreadsheet::ParseExcel::FmtDefault->new should work fine - it's what Spreadsheet::ParseXLSX does internally, anyway.

OK. that I didn't read from the docs, but it replaces the undefined error with errors for color functionality

I don't know that duplicating all of the internal classes that Spreadsheet::ParseExcel uses is necessarily a useful idea. The idea behind being a drop in adaptor is that the parser is different, but the classes that it ends up generating are still the same, so I think if you want one of those classes, you should just use it directly. If you want to submit a documentation patch for that, I'd certainly be open to that.

I will now have to chase different errors:

• formatted merged cells • unformatted merged cells • hidden cells • cells with color:

Failed test 'no warnings'

at t/62_fmt.t line 55.

There were 243 warning(s)

Previous test 0 ''

Argument "#000000" isn't numeric in numeric eq (==) at /pro/3gl/CPAN/Spreadsheet-Read/blib/lib/Spreadsheet/Read.pm line 243.

at /pro/3gl/CPAN/Spreadsheet-Read/blib/lib/Spreadsheet/Read.pm line 243.

Spreadsheet::Read::_xls_color("#000000") called at /pro/3gl/CPAN/Spreadsheet-Read/blib/lib/Spreadsheet/Read.pm line 530

Spreadsheet::Read::ReadData("files/attr.xlsx", "attr", 1) called at t/62_fmt.t line 15

The colors in Excel were defined like this:

sub _xlscolor { my ($clr, @clr) = @; defined $clr or return undef; @clr == 0 && $clr == 32767 and return undef; # Default fg color @clr == 2 && $clr == 0 and return undef; # No fill bg color @clr == 2 && $clr == 1 and ($clr, @clr) = ($clr[0]); @clr and return undef; # Don't know what to do with this "#" . lc Spreadsheet::ParseExcel->ColorIdxToRGB ($clr); } # _xls_color

which is called from the API like

            fgcolor => _xls_color ($FnT->{Color}),
            bgcolor => _xls_color (@{$FmT->{Fill}}),

where $FmT = $oWkC->{Format}; and $FnT = $FmT->{Font};

what happens, is that XLSX now returns the colors directly

Spreadsheet::Read::_xls_color (1, "#008000", "#FFFFFF")

instead of using the color table. I can change that in my functions, but it is not API-compatible. I can reduce all the errors like this

sub _xlscolor { my ($clr, @clr) = @; defined $clr or return undef; @clr == 0 && $clr =~ m/^#[0-9a-fA-F]+$/ and return $clr eq "#000000" ? undef : lc $clr;

print STDERR "# [ $clr @clr ]\n";

@clr == 0 && $clr == 32767 and return undef; # Default fg color
@clr == 2 && $clr ==     0 and return undef; # No fill bg color
@clr == 2 && $clr ==     1 and ($clr, @clr) = ($clr[0]);
@clr and return undef; # Don't know what to do with this
$clr =~ m/^#[0-9a-fA-F]+$/ and return $clr eq "#000000" ? undef : lc $clr;
"#" . lc Spreadsheet::ParseExcel->ColorIdxToRGB ($clr);
} # _xls_color

but I wonder if that is what it really means. I then still have this:

t/64_dates.t ... 1/?

Failed test 'Date format row 1 col 1'

at t/64_dates.t line 32.

got: 'd-mmm'

expected: undef

I do not want you to fix my module, but as it stands now, it is not just a drop-in replacement

$ diff -w t/dates 6c6

< my $tests = 69;

my $tests = 71; 11c11

< Spreadsheet::Read::parses ("xls") or

Spreadsheet::Read::parses ("xlsx") or 17c17

< ok ($xls = ReadData ("files/Dates.xls", attr => 1, dtfmt => "yyyy-mm-dd"), "Excel Date testcase");

ok ($xls = ReadData ("files/Dates.xlsx", attr => 1, dtfmt => "yyyy-mm-dd"), "Excel Date testcase"); 19,20c19,24 < my $ss = $xls->[1];

< my $attr = $ss->{attr};

SKIP: { ok (my $ss = $xls->[1], "sheet"); ok (my $attr = $ss->{attr}, "attr");

defined $attr->[2][1]{format} or
  skip "$xls->[0]{parser} $xls->[0]{version} does not reliably support formats", 68;

55,66c59 < < # Below can only be checked when SS::PE 0.34 is out < #use DDumper; < #foreach my $r (1..4,6..7) { < # foreach my $c (1..5) { < # my $cell = cr2cell ($c, $r); < # my $fmt = $ss->{attr}[$c][$r]{format}; < # defined $ss->{$cell} or next; < # printf STDERR "# attr %s: %-22s %s\n", < # $cell, $ss->{$cell}, defined $fmt ? "'$fmt'" : ""; < # }

< # }

}

H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

doy commented 10 years ago

So the problem with colors is that XLSX files themselves are allowed to specify colors directly rather than only being allowed to use the color table. I figured it would be more straightforward to just always return the actual color string rather than sometimes returning a color string and other times returning an index (and munging up the color table to include things that weren't actually in the color table in the file didn't seem like a good idea). It's true that this should at least be documented, though.

You mentioned on #ox that the issue with d-mmm was actually a bug in Spreadsheet::ParseExcel - were there any other issues you noticed other than those two?

Tux commented 10 years ago

On Wed, 22 Jan 2014 10:08:19 -0800, Jesse Luehrs notifications@github.com wrote:

So the problem with colors is that XLSX files themselves are allowed to specify colors directly rather than only being allowed to use the color table. I figured it would be more straightforward to just always return the actual color string rather than sometimes returning a color string and other times returning an index (and munging up the color table to include things that weren't actually in the color table in the file didn't seem like a good idea).

But that would better match what the older Excel format does. Read on. I can live with the current state of affairs with one single exception at the moment.

It's true that this should at least be documented, though.

What then should also be documented is the values for color index 32767 (default fg color). You now return "#000000" where I can detect undefined in Spreadsheet::ParseExcel

Withe this code: --8<---

Convert a single color (index) to a color

sub _xls_color { my $clr = shift; defined $clr or return undef; $clr eq "#000000" and return undef; $clr =~ m/^#[0-9a-fA-F]+$/ and return lc $clr; $clr == 0 || $clr == 32767 and return undef; # Default fg color return "#" . lc Spreadsheet::ParseExcel->ColorIdxToRGB ($clr); } # _xls_color

Convert a fill [ $pattern, $front_color, $back_color ] to a single background

sub _xlsfill { my ($p, $fg, $bg) = @; defined $p or return undef; $p == 32767 and return undef; # Default fg color $p == 0 && !defined $bg and return undef; # No fill bg color $p == 1 and return _xls_color ($fg); $bg < 8 || $bg > 63 and return undef; # see Workbook.pm#106 return _xls_color ($bg); } # _xls_fill -->8--- and later: --8<--- fgcolor => _xls_color ($FnT->{Color}), bgcolor => _xls_fill (@{$FmT->{Fill}}), -->8---

I pass all color tests for ParseExcel and ParseXLSX, but I do not like the $clr eq "#000000" line in _xls_color ()

You mentioned on #ox that the issue with d-mmm was actually a bug in Spreadsheet::ParseExcel - were there any other issues you noticed other than those two?

Yes two things: • the rounding of percentages, but I have not yet been able to localize the exact point of difference. The only value from my test that acts different is 0.5% which now gets rounded to 1% in your parser • $FmT->{Hidden} not set on hidden cells

H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/ using perl5.00307 .. 5.19 porting perl5 on HP-UX, AIX, and openSUSE http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/ http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

doy commented 10 years ago

The default color is now returned as undef, and I've added support for hidden and locked cells. The rounding issue with percents appears to be a floating point edge case issue as discussed on IRC (the results are different for perls configured with and without -ld), so I'll leave this alone unless some evidence comes up that this is actually a bug. I'll get a release out shortly.