dilshod / xlsx2csv

Convert xslx to csv, it is fast, and works for huge xlsx files
MIT License
1.67k stars 301 forks source link

Small patch for missing sheet ;) #101

Open mabramowicz opened 8 years ago

mabramowicz commented 8 years ago

Hi!

--- DDD_xlsx2csv.py 2016-06-13 17:01:01.676067849 +0200 +++ orig_xlsx2csv.py 2016-06-13 17:01:01.721067828 +0200 @@ -185,8 +185,7 @@ raise OutFileAlreadyExistsException("File " + str(outfile) + " already exists!") for s in self.workbook.sheets: sheetname = s['name']

Email was sent. Now some info:

XLSX file

Content of xl/workbook.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Content of xl/_rels/workbook.xml.rels

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Problem with oryginal script is how to get ID of sheet, for sheet name. Example of use:

./orig_xlsx2csv.py -a -d ";" GENEROWANY.xlsx /tmp/XXXXX Sheet 15 not found

Which means that 'id' 15 goes to "def _convert(self, sheetid, outfile):" where name is created like that: "sheetfile = self._filehandle("xl/worksheets/sheet%i.xml" % sheetid)"

What kind of ID is prepared? Please refer to Class Workbook, there is lines:

What I know this is like, appName = XL, then lenght of value > 2, so it will always passed to: id = int(attrs["r:id"].value[3:]) I do not know what to think about it, but this check will always go to one direction to made id = int(attrs["r:id"].value[3:])

        if self.appName == 'xl' and len(attrs["r:id"].value) > 2:
            if 'r:id' in attrs: id = int(attrs["r:id"].value[3:])
            else: id = int(attrs['sheetId'].value)
        else:
            if 'sheetId' in attrs: id = int(attrs["sheetId"].value)
            else: id = int(attrs['r:id'].value[3:])
        self.sheets.append({'name': name, 'id': id})

This part of script can be used only when DIGITS on sheetId=XX, rIdXX and Target="worksheets/sheetXX.xml are same. I mean sheetId=8 -> r:id=rId8 -> Target="worksheets/sheet8.xml" only then it works like exepcted.

When XLSX was generated by PHPExcel libs situation is different then expected. Because sheetId can has other NUMBER then r:id or sheet name.

It looks like name of xml file should be taken by rId from file "xl/_rels/workbook.xml.rels"

When XLSX file is fine (accepted by Openoffice or Excel2013): ./DDD_xlsx2csv.py -a -d ":" GENEROWANY1.xlsx /tmp/adasd rId: rId1 , XML: styles.xml rId: rId2 , XML: theme/theme1.xml rId: rId3 , XML: sharedStrings.xml rId: rId4 , XML: worksheets/sheet1.xml rId: rId5 , XML: worksheets/sheet2.xml rId: rId6 , XML: worksheets/sheet3.xml rId: rId7 , XML: worksheets/sheet4.xml rId: rId8 , XML: worksheets/sheet5.xml rId: rId9 , XML: worksheets/sheet6.xml rId: rId10 , XML: worksheets/sheet7.xml rId: rId11 , XML: worksheets/sheet8.xml rId: rId12 , XML: worksheets/sheet9.xml rId: rId13 , XML: worksheets/sheet10.xml rId: rId14 , XML: worksheets/sheet11.xml rId: rId15 , XML: worksheets/sheet12.xml rId: rId16 , XML: worksheets/sheet13.xml rId: rId17 , XML: worksheets/sheet14.xml Sheet 15 not found

Which means value 15 was passed to: def _convert(self, sheetid, outfile): Then it failed because: sheetfile = self._filehandle("xl/worksheets/sheet%i.xml" % sheetid) -> sheetfile = self._filehandle("xl/worksheets/sheet15.xml") but in my example file it is:

and there is no "xl/worksheets/sheet15.xml" in XLSX arch (zip file ;) )

So even it could works somehow it will match wrong sheet name to his content ...

Anyway, with my changes: 1) file generated by PHPExcel You can see that there id on LEFT side is the same digit in sheet%{X}.xml name ...

./DDD_xlsx2csv.py -a -d ":" GENEROWANY1.xlsx /tmp/adasd sheetid: 4 , name of xml: xl/worksheets/sheet1.xml sheetid: 5 , name of xml: xl/worksheets/sheet2.xml sheetid: 6 , name of xml: xl/worksheets/sheet3.xml sheetid: 7 , name of xml: xl/worksheets/sheet4.xml sheetid: 8 , name of xml: xl/worksheets/sheet5.xml sheetid: 9 , name of xml: xl/worksheets/sheet6.xml sheetid: 10 , name of xml: xl/worksheets/sheet7.xml sheetid: 11 , name of xml: xl/worksheets/sheet8.xml sheetid: 12 , name of xml: xl/worksheets/sheet9.xml sheetid: 13 , name of xml: xl/worksheets/sheet10.xml sheetid: 14 , name of xml: xl/worksheets/sheet11.xml sheetid: 15 , name of xml: xl/worksheets/sheet12.xml sheetid: 16 , name of xml: xl/worksheets/sheet13.xml sheetid: 17 , name of xml: xl/worksheets/sheet14.xml

2) XLSM file from Excel: ./DDD_xlsx2csv.py -a -d ":" SAK.xlsm /tmp/adasd sheetid: 1 , name of xml: xl/worksheets/sheet1.xml sheetid: 2 , name of xml: xl/worksheets/sheet2.xml sheetid: 3 , name of xml: xl/worksheets/sheet3.xml sheetid: 4 , name of xml: xl/worksheets/sheet4.xml sheetid: 5 , name of xml: xl/worksheets/sheet5.xml sheetid: 6 , name of xml: xl/worksheets/sheet6.xml sheetid: 7 , name of xml: xl/worksheets/sheet7.xml sheetid: 8 , name of xml: xl/worksheets/sheet8.xml sheetid: 9 , name of xml: xl/worksheets/sheet9.xml sheetid: 10 , name of xml: xl/worksheets/sheet10.xml sheetid: 11 , name of xml: xl/worksheets/sheet11.xml sheetid: 12 , name of xml: xl/worksheets/sheet12.xml sheetid: 13 , name of xml: xl/worksheets/sheet13.xml

3) file generated by JAVA software from some application (is not xl style) ./DDD_xlsx2csv.py -a -d ":" gsw_CMDB_1465822227.xlsx /tmp/adasd sheetid: 21 , name of xml: xl/worksheets/sheet21.xml sheetid: 51 , name of xml: xl/worksheets/sheet51.xml

4) generated file from Excel 2013 ./DDD_xlsx2csv.py -a -d ";" Book1.xlsx /tmp/Book1 sheetid: 1 , name of xml: xl/worksheets/sheet1.xml sheetid: 2 , name of xml: xl/worksheets/sheet2.xml sheetid: 3 , name of xml: xl/worksheets/sheet3.xml sheetid: 4 , name of xml: xl/worksheets/sheet4.xml sheetid: 5 , name of xml: xl/worksheets/sheet5.xml

5) file generated from phpexcel ./DDD_xlsx2csv.py -a -d ";" phpexcel.xlsx /tmp/php self.workbook.appName: xl sheetid: 4 , name of xml: xl/worksheets/sheet1.xml self.workbook.appName: xl sheetid: 5 , name of xml: xl/worksheets/sheet2.xml

Last check ... ./DDD_xlsx2csv.py -a -d ";" Book1.xlsx /tmp/php SheetName: Test1 , sheetId: 1 , relationID/r:id: rId1 , short rId: 1 SheetName: What3 , sheetId: 2 , relationID/r:id: rId2 , short rId: 2 SheetName: Noo123 , sheetId: 3 , relationID/r:id: rId3 , short rId: 3 SheetName: Rename21 , sheetId: 4 , relationID/r:id: rId4 , short rId: 4 SheetName: Names01 , sheetId: 5 , relationID/r:id: rId5 , short rId: 5 id passed to _convert: 1 , xml file: xl/worksheets/sheet1.xml id passed to _convert: 2 , xml file: xl/worksheets/sheet2.xml id passed to _convert: 3 , xml file: xl/worksheets/sheet3.xml id passed to _convert: 4 , xml file: xl/worksheets/sheet4.xml id passed to _convert: 5 , xml file: xl/worksheets/sheet5.xml [mabramowicz@oc8656482817 DIR.8]$ ./DDD_xlsx2csv.py -a -d ";" phpexcel.xlsx /tmp/php SheetName: Worksheet , sheetId: 1 , relationID/r:id: rId4 , short rId: 4 SheetName: Worksheet 1 , sheetId: 2 , relationID/r:id: rId5 , short rId: 5 id passed to _convert: 4 , xml file: xl/worksheets/sheet1.xml id passed to _convert: 5 , xml file: xl/worksheets/sheet2.xml

It looks like it works like I want.

Prepare dictionary with xml name and rId value, then use it...

I know, I am not pro in python, but it works ...

Maybe you can check yours point of view???

mabramowicz commented 8 years ago

Hm ...

No replay, that is bad ...