a1b10 / cl-xlsx

📜 Read XLSX files with Common Lisp
23 stars 2 forks source link

weird data rearrangement bug #4

Open slyrus opened 4 years ago

slyrus commented 4 years ago

test2.xlsx When I try to read the attached xmls file, I get:

(first
  (cl-xlsx:read-xlsx (merge-pathnames "test2.xlsx" *rat-tox-data-path*)))

("Sheet1" ("Hello:" " ") ("Rat" "Bye"))

I expected to get ("Hello:" "Rat") ("Bye" "Monkey").

slyrus commented 4 years ago

Hmm... maybe my bogus fix for issue #2 was causing this to break. Will close for now.

slyrus commented 4 years ago

No, I take it back. It still happens.

gwangjinkim commented 4 years ago

Hm .. I'll have a look. Probably because "Hello:" is bold ... I wrote it really only for simplest tables - without formatting.

slyrus commented 4 years ago

Ah, Ok. Yes, reading styled text (without the style information) would be nice, as would reading merged cells, but that's a whole 'nother issue.

gwangjinkim commented 4 years ago

it is indeed strange. Sorry really. I thought my package is better than that ...

gwangjinkim commented 4 years ago

it is midnight here, so I will go to sleep. Hm, yes, reading styled text - that would mean one needs more sophisticated parsing of attributes.

The thing is, if one wants to write a real xlsx reader - equivalent to openxlsx in R or Pythons excel reader packages - that would require to really understand the specifications for xlsx - which will be really many hours of work. Maybe the determination of fields to be read - we could make to work.

But formatting would be a lot work to do - just to understand the xlsx specifications ... Or are you deeper into xml and xlsx stuff?

I'm unfortunately quite a beginner in this area.

gwangjinkim commented 4 years ago

I reformatted - added more slides. But indeed the but persists. It has nothing to do with formatting it seems.

I have seen similar problems with Nano's 'xlsx' package. If a field was empty " " - it happened there because the text piece could not be found in the xml file which lists all unique text strings.

Hm ... yes, it must have to do with the parsing of the xml file inside xlsx (zip file). there seems to be a shift in position of the unique strings ...

and this shift must happen after the first word ...

indeed

(cl-xlsx::get-unique-strings "~/test2.xlsx") returns:

("Hello:" " " "Rat" "Bye" "Monkey")

The empty space in the second position caused the error.

And this empty space will have appeared due to the bold lettering (formatting) of the first cell ...

Indeed, if you reformat the bold lettering, this " " at second place of unique strings disappears. And everything is correct.

gwangjinkim commented 4 years ago

My ideal usecase was using this package for reading-in csv-like data - just printed into excel (for better human readability). But without formats (bold lettering). And it should be just simple tables (like such in csv files).

The problem with excel files and workbooks is that the xlsx specifications are so many ... Just to read and then to understand the specifications and keep the overview would be easily more than a week of work - I think. Implementing them will takes weeks or months. - at least with my lisp skills ...

gwangjinkim commented 1 week ago

I started to translate a Racket package (simmone/racket-simlple-xlsx) into Common Lisp. It is still a long way - but I think totally doable. I will keep you updated.