dhoerl / DHlibxls

Framework to read Excel xls spreadsheets
271 stars 86 forks source link

continued project cannot read xls files #19

Open bengalo opened 9 years ago

bengalo commented 9 years ago

You are probably aware that in this project you are using your libxls and the follow-up project by Jan is using xlslib which cannot read xls files at all, right? While there is no option to read/parse files in Jan's project (yet?), I'd advise to update the README once again to reflect that. Might save someone (like me) an hour or two ;)

dhoerl commented 9 years ago

Sorry for the confusion - I tried to answer too fast. For sure I will fix this problem - give me a few days.

DLGill commented 9 years ago

I can generate xlxs files instead of xls files if JXLS will read those.
So which project should I use to read either XLS or XLXS files on Machintosh OSx? Is it the DHxlsreader, or JXLS?

nPowerLogo.gif

IwLogo.gif Kind Regards,

David Gill President

IntegrityWare, Inc.(http://www.IntegrityWare.com) nPower Software (http://www.nPowerSoftware.com/) DGILL@nPowerSoftware.com mailto:DGILL@nPowerSoftware.com Phone: 858-592-8866 Fax: 858-592-8844

On 1/10/2015 9:48 AM, bengalo wrote:

You are probably aware that in this project you are using your libxls and the follow-up project by Jan is using xlslib which cannot read xls files at all, right? While there is no option to read/parse files in Jan's project (yet?), I'd advise to update the README once again to reflect that. Might save someone (like me) an hour or two ;)

— Reply to this email directly or view it on GitHub https://github.com/dhoerl/DHlibxls/issues/19.

dhoerl commented 9 years ago

The code can only read .xls binary files - the xlsx are XML format and totally different. Very sorry for the confusion earlier - I had a DHxlslib project earlier, and stopped supporting it due to Jan taking it over (renaming it and adding more features). DHlibxls to read on Mac/ios, and JXLS to write them.

DLGill commented 9 years ago

Hi David,

So, was I correct to use the DHxlsReader, or should I be using the JXLS? I took the DHxlsReader out of my project over the weekend, but I hadn't quite got the JXLS hooked up yet.

By the way, I found that everything else in the DHxlsReader is very high quality. It is well written code, and works very well (except the one problem I found).

nPowerLogo.gif

IwLogo.gif Kind Regards,

David Gill President

IntegrityWare, Inc.(http://www.IntegrityWare.com) nPower Software (http://www.nPowerSoftware.com/) DGILL@nPowerSoftware.com mailto:DGILL@nPowerSoftware.com Phone: 858-592-8866 Fax: 858-592-8844

On 1/10/2015 3:09 PM, David Hoerl wrote:

Sorry for the confusion - I tried to answer too fast. For sure I will fix this problem - give me a few days.

— Reply to this email directly or view it on GitHub https://github.com/dhoerl/DHlibxls/issues/19#issuecomment-69476024.

dhoerl commented 9 years ago

To add to the confusion is that I named the project 'DHlibxls', but the actual framework is 'DHxlsReader'. You only need JXLS if you want to WRITE Excel files. BTW, did you send me the problematic .xls file yet?

DLGill commented 9 years ago

Hi David,

Here is the Excel file. So I will put DHxlsReader back into my project. By the way, I will continue to try to debug the problem and send you a fix if I find it. I haven't gotten to that point yet (but I think I was pretty close). It must have something to do with being an empty cell inside of a table. The 3B cell in the first empty cell, and the one that it errors out on.

By the way, I really appreciate your code. You have done a very nice job with it.

nPowerLogo.gif

IwLogo.gif Kind Regards,

David Gill President

IntegrityWare, Inc.(http://www.IntegrityWare.com) nPower Software (http://www.nPowerSoftware.com/) DGILL@nPowerSoftware.com mailto:DGILL@nPowerSoftware.com Phone: 858-592-8866 Fax: 858-592-8844

On 1/12/2015 12:09 PM, David Hoerl wrote:

To add to the confusion is that I named the project 'DHlibxls', but the actual framework is 'DHxlsReader'. You only need JXLS if you want to WRITE Excel files. BTW, did you send me the problematic .xls file yet?

— Reply to this email directly or view it on GitHub https://github.com/dhoerl/DHlibxls/issues/19#issuecomment-69636575.

DLGill commented 9 years ago

Hello David,

No worries.

Isn't the .xlxs format just the newer Excel format? I can produce either, but my version of Excel defaults to the .xlxs format. DHxlsReader kicks them out as non Excel files, but they are just newer format.

nPowerLogo.gif

IwLogo.gif Kind Regards,

David Gill President

IntegrityWare, Inc.(http://www.IntegrityWare.com) nPower Software (http://www.nPowerSoftware.com/) DGILL@nPowerSoftware.com mailto:DGILL@nPowerSoftware.com Phone: 858-592-8866 Fax: 858-592-8844

On 1/12/2015 11:47 AM, David Hoerl wrote:

The code can only read .xls binary files - the xlsx are XML format and totally different. Very sorry for the confusion earlier - I had a DHxlslib project earlier, and stopped supporting it due to Jan taking it over (renaming it and adding more features). DHlibxls to read on Mac/ios, and JXLS to write them.

— Reply to this email directly or view it on GitHub https://github.com/dhoerl/DHlibxls/issues/19#issuecomment-69632863.

DLGill commented 9 years ago

Hey Dave,

I just wondered if you have been able to track down the problem with tables.

David Gill

ghost commented 9 years ago

Unless you can convince me this is a legitimate un-protected file, which all evidence indicates to the contrary, I don't plan to waste my time on it. Numbers refuses to open it, and also saved variants with an additional number added in the column you indicated. I just tried to open it with google documents, and only a few things show up.

This file has serious issues. If you want to compensate me for the time needed to delve into the belly of libxls and all the microsoft oddities, I can do it - but I am not going to waste the few hours of time I can devote to open source on this.

One of the projects contributors has a production app that reads and imports files using the library. From time to time files show up that cannot be opened - they were created in a buggy point release of Excel or some other program. Excel itself has knowledge of all past oddities, is supported by hundreds of paid programmers, and has a compelling reason to open any file.

DLGill commented 9 years ago

Can you tell me which protection error it is generating (0x86, 0x63, or 0x2F), and where that error is returned? I will try to track it down. I don't understand how the file can be protected because I saved the file by hand (converting it from a .xlsx file) from within Excel and none of the protection flags are on (it's not read-only, there is no password set, protection is turned off, etc.). Perhaps I have a buggy version of Excel, but that seems odd too as it is a standard, off the shelf purchased version.

DLGill commented 9 years ago

It appears from looking at xls_ParseWorkBook (in xls.c) that all of the protection flags are skipped and not processed. Is this correct?

ghost commented 9 years ago

If you 'down' converted it from an xlsx file, well who knows what Excel does. Unfortunately at this time I only have an old 2002ish version of Excel that only runs on a PowerPC based Mac. My experience has been that when I see protection flags, it means Excel has done something to the data that makes it difficult for libxls to decode. libxls does nothing with the protection flags, but its an indication that the file has been 'protected' in some form by Excel. In fact, we've considered just aborting when we see those flags, to avoid long discussions such as this one.

The way libxls gets data is to decode the file first - to extract out all the tables and such - then as it iterates through all sheets and cells, it uses associations as indicated by the particular data cell to link to the data. In some cases with protection, there is no data - Excel itself takes blocks of 'protected data', and decodes it, then puts it in the 'common' accessible areas to be used. This is possibly one reason for the issue here. Another reason for the missing data may be that in down converting xlsx files, it uses some sophisticated technique that libxls is not aware of.

What I would do to track this down, and something you can do your self, is to turn debugging way up - 10 or higher, then dump the file. It will be easier if you use libxls directly - I have an Xcode Mac project that I use for this, but it could be anything even command line.

You set the debug flag to 10, then dump the file. You can then start adding printf's all over the place in the specific areas you are trying to see, and find out for the particular cell that has no data, where the library is trying to get it from. You can then dump that whole table (linked list, whatever). You can then find the code that creates that table and add printfs there. Use the OpenOffice Excel Format Specification (google it), it helps a lot. You will see that open office also passed over all the protection schemes and objects - that is all are marked as TBD.

Early encryption was very simple and easy to circumvent - its possible other .xls readers have implemented decoding (we passed on doing so years ago, as we didn't want to start down that path).

This library has been around for years, and has wide adoption. There have been several real bugs that appeared, but the frequency has dropped over time. I have actually had success using Numbers to both create .xls files and read them when others like you submit them - to see what they look like before trying to process them with libxls. That neither Numbers nor Google Sheets can effectively read this file is a huge red flag.

If you do find a bug in libxls, or something simple that needs correction so your file can be read, then I'd for sure add it to the source. I just am not in a position currently to burn a day tracking this down.