Fix sheet names - Githubissues

slyrus commented 4 years ago

So it turns out that sheet names are more complicated than the previous code assumed. The sheetId in workbook/sheets/sheet is a string that encodes a number. Previous code used that number as the index into the sheets. Really, one has to get the r:id attribute from the sheet then use the xl/_rels/workbook.xml.rels which maps the relationship Id to a Target sheet. Ugh. Anyway, this fixes my sheet name problems. Would be good to test this on a variety of XLSX files to make sure that the namespaces (etc...) all play nice together.

gwangjinkim commented 4 years ago

Thank you so much! True - it has to be better tested - hopefully people use it like you. And sorry, I was and am still on vacation and couldn't time yet to look carefully. But I trust your diligent work! xlsx specifications are quite complex. I don't know how Python or R exactly did it. Maybe one has to dig into the packages there and try to translate the stuff to common lisp. R's "openxlsx" package is quite good in dealing with xlsx files. Or a Python package one would need.

slyrus commented 4 years ago

thanks for merging! Yes, the specs are complex but the infrastructure (reading the XML streams, tools for working with XML documents, etc...) are all here so it feels like it shouldn't be all that much work to have something on par with readxl or openxlsx (not sure about the python excel libraries).

I hope you're OK with my style choice of using xpath queries fxml (and fxml.stp, etc...) as I find that, compared to, e.g., klacks 1) I can write the code more quickly, 2) the end product code is more readable and 3) the code is more maintainable. There may be some efficiency hits to this approach, but I'd rather be able to read a 1 MB excel file well than poorly (but quickly) parse a 1 GB excel file. It might be nice to open an issue or wiki page with features we'd all like to see. Off the top of my head, reading a specific range and better support for merged cells (which readxl doesn't support well, last I checked).

gwangjinkim commented 4 years ago

Thank YOU for contribution and improving the package! I will document that in package description what you changed.

For me, it is interesting to see how you use xpath queries fxml.

However, my choice of klacks was exactly to be able to parse 1 GB or more excel files, because in Bioinformatics, we have not rarely several GBs big files. But I was anyway thinking of adding a performance-prioritizing read process. Maybe we should add to read-xlsx a performance prioritizing parameter which switches between your and the klacks-dependent solution.

And yes, you are right, we should communicate the features we need. (do you mean one issue for all desired features or for each desired features one issue?). And true reading a specific range and a support for merged cells would be indeed cool. Do you know how one could realize that?

If you plan to tinker more around with this package, tell me. I will make you as official co-author of this package since you put in a lot of time.

slyrus commented 4 years ago

Ugh... multi GB files should not be excel files :) but that's just my opinion. But, yes, I can see how a streaming interface might be useful but it's just that there are so many things like the shared strings, formats, multiple sheets, etc... that I'd like to see handled properly first in a "read it all into memory at once" model. But, yeah, I can see why you might want a decent streaming interface. Not sure klacks is it though :)

I have some ideas on the merged cells thing but haven't gotten around to it yet.

In the meantime, I've made another package excel-range which allows for things like:

(excel-range:excel-data *my-excel-sheet* "A4:D34")

check it out if you have time.

a1b10 / cl-xlsx

Fix sheet names #8