doy / spreadsheet-parsexlsx

parse XLSX files
http://metacpan.org/release/Spreadsheet-ParseXLSX
27 stars 35 forks source link

xlsx fails to load due to non-standard stylesheet and workbook locations #79

Open laurielounge opened 6 years ago

laurielounge commented 6 years ago

Hi gang,

test_sheet_new.xlsx

The attached file is a valid spreadsheet (can open in Excel) but stylesheet.xml and worksheet.xml, rather than being in /xl/, are in / and consequently, the sheet cannot be opened.

To test, run parse on the file. Error message indicates stylesheet.xml cannot be found.

laurielounge commented 6 years ago

More info - my first thought seems to have been wrong. Error looks like:

no subfile named /stylesheet.xml at ../perl5/Spreadsheet/ParseXLSX.pm line 967

Version $Spreadsheet::ParseXLSX::VERSION = '0.27';

ParseExcel our $VERSION = '0.65';

after unzipping the xlsx file into its own directory find . -type f looks like:

./workbook.xml
./_rels/.rels
./_rels/workbook.xml.rels
./[Content_Types].xml
./stylesheet.xml
./xl/worksheets/data.xml

cat _rels/.rels

<?xml version="1.0" encoding="utf-8" standalone="yes"?>

  <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
    <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="workbook.xml" Id="rw_1"/></Relationships>

cat _rels/workbook.xml.rels

<?xml version="1.0" encoding="utf-8" standalone="yes"?>

  <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
    <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="/xl/worksheets/data.xml" Id="r_1"/>
    <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="stylesheet.xml" Id="r_2"/></Relationships>
laurielounge commented 6 years ago

Just for emphasis:

cat _rels/workbook.xml.rels

<?xml version="1.0" encoding="utf-8" standalone="yes"?>

  <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
    <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="/xl/worksheets/data.xml" Id="r_1"/>
    <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="stylesheet.xml" Id="r_2"/></Relationships>

and cat workbook.xml

<?xml version="1.0" encoding="utf-8" standalone="yes"?>

  <workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
    <sheets>
      <sheet name="Data" sheetId="1" r:id="r_1">
        <bookViews>
          <workbookView/></bookViews></sheet></sheets></workbook>

Looking at _rels/workbook.xml.rels, there is no actual Target="stylesheet.xml", at least, not in that folder.

The only other reference to stylesheet: grep -r stylesheet * comes from [Content_type].xml: <Override PartName="/stylesheet.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.styles+xml"/>

laurielounge commented 6 years ago

Further...

If I edit _rels/workbook.xml.rels, and change <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="stylesheet.xml" Id="r_2"/></Relationships>

to <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="/stylesheet.xml" Id="r_2"/></Relationships>

The file parses.I'm guessing here, but I suspect when Excel opens the file, it has a harder look for stylesheet.xml than ParseXLSX is doing. Another possibilty is that it just throws up it's hand and says open the file anyway, without the stylesheet info. Will experiment. Looking at the info in stylesheet, and the way Excel displays the file, I suspect it's the former (stylesheet seems to have been applied).

laurielounge commented 6 years ago

Further still...

If we define "valid spreadsheet file" as one successfully read by Excel, then this is a valid spreadsheet file.

Once again find . -type f reveals

./workbook.xml
./_rels/.rels
./_rels/workbook.xml.rels
./[Content_Types].xml
./stylesheet.xml
./xl/worksheets/data.xml

grep -r Target * reveals

_rels/.rels:    <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="workbook.xml" Id="rw_1"/></Relationships>
_rels/workbook.xml.rels:    <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="/xl/worksheets/data.xml" Id="r_1"/>
_rels/workbook.xml.rels:    <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="stylesheet.xml" Id="r_2"/></Relationships>

We see that, in _rels/, we have a workbook.xml.rels file. _rels/.rels defines a target of workbook.xml (no ".rels"). _rels/workbook.xml.rels defines a "stylesheet.xml". Not "/stylesheet.xml"

I'm not currently sure which target is looked for first, but one assumes it's 'stlyesheet.xml', as that's where it's erroring. One assumes that fixing the location of stylesheet (which I will do manually do next) will cause the error to become "not found workbook.xml".

laurielounge commented 6 years ago

Right. I can confirm that if I manually edit _rels/workbook.xml.rels and adjust <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="stylesheet.xml" Id="r_2"/></Relationships> to read <Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="/stylesheet.xml" Id="r_2"/></Relationships>

ParseXLSX can parse the sheet. I can print the value in $ws->get_cell(0,0) correctly.

Excel can still successfully read the file (it is still a "valid workbook")

I see a possible approaches here. Pre-check to see if all of the targets exist and are in the appropriate places, adjust if necessary. Is the xml read once and then passed from routine to routine in ParseXLSX? If so, it shouldn't be too hard to see if the stylesheet is where the xml says it is.

I'll continue playing with it.