Lisp-Stat / lisp-stat

Lisp-Stat main system
https://lisp-stat.github.io/lisp-stat
Microsoft Public License
145 stars 11 forks source link

Exporting data-frame to CSV #15

Closed cfpvl closed 1 year ago

cfpvl commented 2 years ago

I'm learning CL and I was trying to play with the Titanic dataset to generate predictions using CL. I'm writing a tutorial as it helps me with the learning process.

However, using this fantastic package lisp-stat, I've encountered a few problems. One of them is that when done cleaning my dataframe, I could not export it as a CSV:

(write-csv *train*
                 #P"test_ELT_train.csv" 
                 :add-first-row t)

with the error #<TYPE-ERROR expected-type: BIT datum: 3>

Running Ubuntu 20.04 (KDE Neon), sbcl-2:2.0.1-3.amd64, Emacs 28.1 (as a Snap package), slime 20220712.817.

Please, let me know if you need any additional information and how can I be of help.

Symbolics commented 2 years ago

Do you have a link to the data set and cleaning procedures? I'll try to reproduce the problem.

cfpvl commented 2 years ago

Sure, this is a work in progress but it has all my steps and links (except for the CSV creation): https://nonvalet.com/posts/20220711_ml_with_clisp/

Symbolics commented 2 years ago

Are you using CLISP or SBCL?

Actually, if you've managed to save the data frame to a lisp file, attach that here and I'll work with it to see why the saving to CSV isn't working.

Finally, I'd recommend installing Lisp-Stat via Quicklisp rather than manually from Github. The only reason you need to do that now is that Quicklisp hasn't been updated.

cfpvl commented 2 years ago

Here's the data frame lisp file. Really appreciate the help :) ELT_train.lisp.zip

Symbolics commented 2 years ago

Thanks for that. There are a couple of things I haven't seen before in this data-frame. First, there are uninterned symbols for the variable names, e.g. #:SURVIVED. It's unusual to have uninterned symbols anywhere in lisp, and I'd love to know how you got them in there. I'm also seeing NIL for the type; the check when reading isn't allowing the import because that's not a valid type.

Can you post the sequence of commands you use for your analysis? Perhaps put them in a project on GitHub so I can review and help out.

Symbolics commented 2 years ago

BTW, an easier way to load the data set is to pass read-csv a URL, like this:

(defdf titantic (read-csv "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"))
#<DATA-FRAME (891 observations of 12 variables)>

Now you can start an analysis:

LS-USER> (heuristicate-types titantic)
NIL
LS-USER> (head titantic)

;;   PASSENGERID SURVIVED PCLASS NAME                                                SEX    AGE SIBSP PARCH           TICKET    FARE CABIN EMBARKED
;; 0           1        0      3 Braund, Mr. Owen Harris                             male    22     1     0 A/5 21171  7.2500    NA S
;; 1           2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0 PC 17599 71.2833   C85 C
;; 2           3        1      3 Heikkinen, Miss. Laina                              female  26     0     0 STON/O2. 3101282  7.9250    NA S
;; 3           4        1      1 Futrelle, Mrs. Jacques Heath (Lily May Peel)        female  35     1     0           113803 53.1000  C123 S
;; 4           5        0      3 Allen, Mr. William Henry                            male    35     0     0           373450  8.0500    NA S
;; 5           6        0      3 Moran, Mr. James                                    male    NA     0     0           330877  8.4583    NA Q
NIL
LS-USER> (describe titantic)
TITANTIC
  A data-frame with 891 observations of 12 variables

Variable    | Type         | Unit | Label      
--------    | ----         | ---- | -----------
PASSENGERID | INTEGER      | NIL  | NIL        
SURVIVED    | BIT          | NIL  | NIL        
PCLASS      | INTEGER      | NIL  | NIL        
NAME        | STRING       | NIL  | NIL        
SEX         | STRING       | NIL  | NIL        
AGE         | DOUBLE-FLOAT | NIL  | NIL        
SIBSP       | INTEGER      | NIL  | NIL        
PARCH       | INTEGER      | NIL  | NIL        
TICKET      | INTEGER      | NIL  | NIL        
FARE        | DOUBLE-FLOAT | NIL  | NIL        
CABIN       | SYMBOL       | NIL  | NIL        
EMBARKED    | SYMBOL       | NIL  | NIL        
; No value
cfpvl commented 2 years ago

Thank you for the tip.

Here are my steps to reproduce.

(ql:quickload :lisp-stat)
(in-package :ls-user)

;; Create raw data frame
(defdf *train* (read-csv #p"train.csv"))
(describe *train*)

;; add "Relatives" column
(add-column! *train* 'Relatives
             (map-rows *train* '(*train*:sibsp *train*:parch)
                       #'(lambda (s p) (+ s p))))

;; Add "Group" column
(add-column! *train* 'Group
             (map-rows *train* '(*train*:age)
                       #'(lambda (a) (cond
                                       ((EQL a :NA) (setf a "NK"))
                                       ((< a 10) (setf a "Child"))
                                       ((< a 18) (setf a "Young"))
                                       ((< a 60) (setf a "Adult"))
                                       ((>= a 60) (setf a"Senior"))))))

;; remove columns. Can't assign to the same variable *train*:
;;  #<SIMPLE-ERROR "~S package exists and cannot use existing package for data frame name"
(defdf *train1* (remove-columns *train*
                                '(*train*:name *train*:passengerid *train*:cabin *train*:ticket *train*:fare *train*:embarked *train*:sibsp *train*:parch *train*:age)))
(describe *train1*)

;; copy over the prepared data frame to use the same name
(undef '*train*)
(defdf *train* *train1*)
(undef '*train1*)

;; set column types
(heuristicate-types *train*)
(describe *train*)

;; FAIL -  #<TYPE-ERROR expected-type: STRING datum: NIL>
;; Missing value for unit?
;; (summary *train*)

;; set unit and label
(set-properties *train* :unit '(:survived "1/0" :pclass "1-3" :sex "M/F" :relatives ">=0" :Group "Age group"))
(set-properties *train* :label '(:survived "1 = Survived" :pclass "Classes 1 (best) to 3 (worse)" :sex "Male or female" :relatives "How many relatives (siblings, parents, children)" :Group "Child < 10, Young < 18, Adult < 60, Senior >= 60"))
(describe *train*)
(summary *train*)

;; Save as lisp
(lisp-stat:save '*train* #P"ELT_train.lisp")

;; FAIL - Save as CSV
(write-csv *train*
           #P"test_ELT_train.csv"
           :add-first-row t)
;; #<TYPE-ERROR expected-type: BIT datum: 3>
Symbolics commented 2 years ago

This code doesn't run for me past the loading.

LS-USER> (defdf titantic (read-csv "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"))
#<DATA-FRAME (891 observations of 12 variables)>

The first error I encounter is this:

(add-column! titanic 'Relatives
             (map-rows titanic '(titanic:sibsp titanic:parch)
                       #'(lambda (s p) (+ s p))))
; Evaluation aborted on #<KEY-NOT-FOUND {100314FF63}>.

The keys (variables) in a data frame are in the LS-USER package, and you refer to them by just their name, e.g. sibsp (or ls-user:sibsp if ls-user is not the current package, which should only rarely be the case). When you use titantic:sibsp, you're referring to the value of the column. This is so you can perform mathematical operations on variables (columns) without having to manually extract via (column titantic sibsp) or select.

LS-USER> (describe 'sibsp)
LS-USER:SIBSP
  [symbol]
; No value
LS-USER> (describe 'titantic:sibsp)
TITANTIC:SIBSP
  [symbol]

SIBSP names a symbol macro:
  Expansion: (COLUMN #<DATA-FRAME (891 observations of 13 variables)> 'SIBSP)

Symbol-plist:
  :TYPE -> :INTEGER
; No value
LS-USER> titantic:sibsp
#(1 1 0 1 0 0 0 3 0 1 1 0 0 1 0 0 4 0 1 0 0 0 0 0 3 1 0 3 0 0 0 1 0 0 1 1 0 0 2 1 1 1 0 1 0 0 1 0 2 1 4 0 1 1 0 0 0 0 1 5 0 0 1 3 0 1 0 0 4 2 0 5 0 1 0 0 0 0 0 0 0 0 0 0 0 3 1 0 3 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 1 0 1 0 1 0 0 0 1 0 4 2 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 2 0 0 0 1 0 0 0 0 0 0 0 8 0 0 0 0 4 0 0 1 0 0 0 4 1 0 0 1 3 0 0 0 8 0 4 2 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 8 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 3 1 0 0 4 0 0 1 0 0 0 1 1 0 0 0 2 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 4 1 0 0 0 4 1 0 0 0 0 0 0 0 1 0 0 4 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 2 0 0 0 1 0 1 1 0 0 2 1 0 1 0 1 0 0 1 0 0 0 1 8 0 0 0 1 0 2 0 0 2 1 0 1 0 0 0 1 3 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 3 1 0 0 0 0 0 0 0 1 0 0 5 0 0 0 1 0 2 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 3 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 1 2 2 1 0 1 0 1 0 0 0 0 0 2 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0 1 1 0 0 5 0 0 0 1 3 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 1 0 1 0 0 0 0 0 0 0 0 4 4 1 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 2 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 1 2 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 3 0 0 1 0 1 0 0 3 0 2 1 0 0 0 0 0 0 0 0 0 2 0 1 0 0 2 0 0 0 1 2 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 5 1 1 4 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 3 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 2 1 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 4 1 0 0 0 8 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 4 0 0 0 1 0 3 1 0 0 0 4 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 0 0 1 4 0 1 0 1 0 1 0 0 0 2 1 0 8 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0)

Have you gone through the tutorial or getting started guide? I ask because if you have, and things like this aren't clear I consider it a documentation bug. Other languages don't have first-class symbols and a package system, and if you're working with Lisp-Stat it will be much easier if you have a knowledge of Lisp basics. Some resources that might help:

Data frame user manual Common Lisp resources Lisp-Stat community (stackoverflow, email list, reddit)

I'm keen to understand your experiences as a new user. Often, when you've been working with a language/system for a while we forget what it was like to be a newbie, so your experiences are valuable and will help us improve or create documentation to make the onboarding process smoother. So I encourage you to open up documentation issues with reproducible bugs so we can address them.

cfpvl commented 2 years ago

Indeed, I am a newbie to Common Lisp and I have followed the tutorial (and this book - https://gigamonkeys.com/book/) to write my code. I really appreciate you taking the time and effort, and for providing a detailed explanation. Thank you :)

I'm having a hard time understanding why the code has not worked for you (as it is working ofr me)... could it be a typo when you created the dataframe to when you were adding the column? (titantic vs. titanic)

I'm running it again and it appears to be ok. Here's the full output

LS-USER> (defdf x (read-csv "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"))
X
LS-USER> x
#<DATA-FRAME X (891 observations of 12 variables)>
LS-USER> (add-column! x 'Relatives
             (map-rows x '(x:sibsp x:parch)
                       #'(lambda (s p) (+ s p))))
#<DATA-FRAME X (891 observations of 13 variables)>

I'll make sure to review the documentation you've referenced.

Symbolics commented 2 years ago

I made an error in transcribing. I had the misspelling initially, and copied from that.

I do wonder how you're getting this to work though, I can't run that code successfully:

LS-USER> (add-column! x 'Relatives
             (map-rows x '(x:sibsp x:parch)
                       #'(lambda (s p) (+ s p))))

throws me into the debugger:

Key SIBSP not found, valid keys are #(PASSENGERID SURVIVED
                                      PCLASS NAME SEX AGE SIBSP
                                      PARCH TICKET FARE CABIN
                                      EMBARKED).
   [Condition of type KEY-NOT-FOUND]

Restarts:
 0: [RETRY] Retry SLIME REPL evaluation request.
 1: [*ABORT] Return to SLIME's top level.
 2: [ABORT] abort thread (#<THREAD "repl-thread" RUNNING {100BAF9E63}>)

Backtrace:
  0: (DATA-FRAME::KEY-INDEX #<DATA-FRAME::ORDERED-KEYS PASSENGERID, SURVIVED, PCLASS, NAME, SEX, AGE, SIBSP, PARCH, TICKET, FARE, CABIN, EMBARKED> X:SIBSP)
  1: (COLUMN #<DATA-FRAME (891 observations of 12 variables)> X:SIBSP)

You can see in the backtrace, at [1], the call to (column x x:sibsp) is failing because the key can't be found. But, if it's working for you that's great.

BTW, the idiom I would use for this transformation is: (add-column! x 'relatives (e+ x:sibsp x:parch)). In this pattern you should use PACKAGE:VARIABLE because you're adding the values in the columns. I'm surprised PACKAGE:VARIABLE works for you in map-rows; it's not supposed to.

snunez1 commented 2 years ago

@cfpvl How are you going on this? Do you require any further assistance?

cfpvl commented 2 years ago

I still can't export the CSV file... and I could not figure out why the code was not working for @Symbolics I'm a newbie to Common Lisp and I appreciate your time and patience

snunez1 commented 2 years ago

Perhaps you should start with the code I provided in the examples. Does that work for you?

snunez1 commented 1 year ago

@cfpvl, have you got what you need? I'd like to close this issue if everything is working.

cfpvl commented 1 year ago

@snunez1 thanks for asking :) To be honest, I've been slowly learning more about Common Lisp, but I'm afraid I did not have enough time to put some real effort in this. You can go ahead and close this issue and if I'm still having errors going back to it, I'll open it again.