Closed cfpvl closed 1 year ago
Do you have a link to the data set and cleaning procedures? I'll try to reproduce the problem.
Sure, this is a work in progress but it has all my steps and links (except for the CSV creation): https://nonvalet.com/posts/20220711_ml_with_clisp/
Are you using CLISP or SBCL?
Actually, if you've managed to save the data frame to a lisp file, attach that here and I'll work with it to see why the saving to CSV isn't working.
Finally, I'd recommend installing Lisp-Stat via Quicklisp rather than manually from Github. The only reason you need to do that now is that Quicklisp hasn't been updated.
Here's the data frame lisp file. Really appreciate the help :) ELT_train.lisp.zip
Thanks for that. There are a couple of things I haven't seen before in this data-frame. First, there are uninterned symbols for the variable names, e.g. #:SURVIVED
. It's unusual to have uninterned symbols anywhere in lisp, and I'd love to know how you got them in there. I'm also seeing NIL
for the type; the check when reading isn't allowing the import because that's not a valid type.
Can you post the sequence of commands you use for your analysis? Perhaps put them in a project on GitHub so I can review and help out.
BTW, an easier way to load the data set is to pass read-csv
a URL, like this:
(defdf titantic (read-csv "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"))
#<DATA-FRAME (891 observations of 12 variables)>
Now you can start an analysis:
LS-USER> (heuristicate-types titantic)
NIL
LS-USER> (head titantic)
;; PASSENGERID SURVIVED PCLASS NAME SEX AGE SIBSP PARCH TICKET FARE CABIN EMBARKED
;; 0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NA S
;; 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
;; 2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NA S
;; 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
;; 4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NA S
;; 5 6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 NA Q
NIL
LS-USER> (describe titantic)
TITANTIC
A data-frame with 891 observations of 12 variables
Variable | Type | Unit | Label
-------- | ---- | ---- | -----------
PASSENGERID | INTEGER | NIL | NIL
SURVIVED | BIT | NIL | NIL
PCLASS | INTEGER | NIL | NIL
NAME | STRING | NIL | NIL
SEX | STRING | NIL | NIL
AGE | DOUBLE-FLOAT | NIL | NIL
SIBSP | INTEGER | NIL | NIL
PARCH | INTEGER | NIL | NIL
TICKET | INTEGER | NIL | NIL
FARE | DOUBLE-FLOAT | NIL | NIL
CABIN | SYMBOL | NIL | NIL
EMBARKED | SYMBOL | NIL | NIL
; No value
Thank you for the tip.
Here are my steps to reproduce.
(ql:quickload :lisp-stat)
(in-package :ls-user)
;; Create raw data frame
(defdf *train* (read-csv #p"train.csv"))
(describe *train*)
;; add "Relatives" column
(add-column! *train* 'Relatives
(map-rows *train* '(*train*:sibsp *train*:parch)
#'(lambda (s p) (+ s p))))
;; Add "Group" column
(add-column! *train* 'Group
(map-rows *train* '(*train*:age)
#'(lambda (a) (cond
((EQL a :NA) (setf a "NK"))
((< a 10) (setf a "Child"))
((< a 18) (setf a "Young"))
((< a 60) (setf a "Adult"))
((>= a 60) (setf a"Senior"))))))
;; remove columns. Can't assign to the same variable *train*:
;; #<SIMPLE-ERROR "~S package exists and cannot use existing package for data frame name"
(defdf *train1* (remove-columns *train*
'(*train*:name *train*:passengerid *train*:cabin *train*:ticket *train*:fare *train*:embarked *train*:sibsp *train*:parch *train*:age)))
(describe *train1*)
;; copy over the prepared data frame to use the same name
(undef '*train*)
(defdf *train* *train1*)
(undef '*train1*)
;; set column types
(heuristicate-types *train*)
(describe *train*)
;; FAIL - #<TYPE-ERROR expected-type: STRING datum: NIL>
;; Missing value for unit?
;; (summary *train*)
;; set unit and label
(set-properties *train* :unit '(:survived "1/0" :pclass "1-3" :sex "M/F" :relatives ">=0" :Group "Age group"))
(set-properties *train* :label '(:survived "1 = Survived" :pclass "Classes 1 (best) to 3 (worse)" :sex "Male or female" :relatives "How many relatives (siblings, parents, children)" :Group "Child < 10, Young < 18, Adult < 60, Senior >= 60"))
(describe *train*)
(summary *train*)
;; Save as lisp
(lisp-stat:save '*train* #P"ELT_train.lisp")
;; FAIL - Save as CSV
(write-csv *train*
#P"test_ELT_train.csv"
:add-first-row t)
;; #<TYPE-ERROR expected-type: BIT datum: 3>
This code doesn't run for me past the loading.
LS-USER> (defdf titantic (read-csv "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"))
#<DATA-FRAME (891 observations of 12 variables)>
The first error I encounter is this:
(add-column! titanic 'Relatives
(map-rows titanic '(titanic:sibsp titanic:parch)
#'(lambda (s p) (+ s p))))
; Evaluation aborted on #<KEY-NOT-FOUND {100314FF63}>.
The keys (variables) in a data frame are in the LS-USER
package, and you refer to them by just their name, e.g. sibsp
(or ls-user:sibsp
if ls-user
is not the current package, which should only rarely be the case). When you use titantic:sibsp
, you're referring to the value of the column. This is so you can perform mathematical operations on variables (columns) without having to manually extract via (column titantic sibsp)
or select
.
LS-USER> (describe 'sibsp)
LS-USER:SIBSP
[symbol]
; No value
LS-USER> (describe 'titantic:sibsp)
TITANTIC:SIBSP
[symbol]
SIBSP names a symbol macro:
Expansion: (COLUMN #<DATA-FRAME (891 observations of 13 variables)> 'SIBSP)
Symbol-plist:
:TYPE -> :INTEGER
; No value
LS-USER> titantic:sibsp
#(1 1 0 1 0 0 0 3 0 1 1 0 0 1 0 0 4 0 1 0 0 0 0 0 3 1 0 3 0 0 0 1 0 0 1 1 0 0 2 1 1 1 0 1 0 0 1 0 2 1 4 0 1 1 0 0 0 0 1 5 0 0 1 3 0 1 0 0 4 2 0 5 0 1 0 0 0 0 0 0 0 0 0 0 0 3 1 0 3 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0 1 0 1 0 1 0 0 0 1 0 4 2 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 2 0 0 0 1 0 0 0 0 0 0 0 8 0 0 0 0 4 0 0 1 0 0 0 4 1 0 0 1 3 0 0 0 8 0 4 2 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 8 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 3 1 0 0 4 0 0 1 0 0 0 1 1 0 0 0 2 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 4 1 0 0 0 4 1 0 0 0 0 0 0 0 1 0 0 4 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 2 0 0 0 1 0 1 1 0 0 2 1 0 1 0 1 0 0 1 0 0 0 1 8 0 0 0 1 0 2 0 0 2 1 0 1 0 0 0 1 3 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 3 1 0 0 0 0 0 0 0 1 0 0 5 0 0 0 1 0 2 1 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 3 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 1 2 2 1 0 1 0 1 0 0 0 0 0 2 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0 1 1 0 0 5 0 0 0 1 3 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 2 1 0 1 0 0 0 0 0 0 0 0 4 4 1 1 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 2 0 0 0 0 0 2 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 1 2 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 1 1 2 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 3 0 0 1 0 1 0 0 3 0 2 1 0 0 0 0 0 0 0 0 0 2 0 1 0 0 2 0 0 0 1 2 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 5 1 1 4 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 3 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 2 1 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 4 1 0 0 0 8 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 4 0 0 0 1 0 3 1 0 0 0 4 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 8 0 0 1 4 0 1 0 1 0 1 0 0 0 2 1 0 8 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0)
Have you gone through the tutorial or getting started guide? I ask because if you have, and things like this aren't clear I consider it a documentation bug. Other languages don't have first-class symbols and a package system, and if you're working with Lisp-Stat it will be much easier if you have a knowledge of Lisp basics. Some resources that might help:
Data frame user manual Common Lisp resources Lisp-Stat community (stackoverflow, email list, reddit)
I'm keen to understand your experiences as a new user. Often, when you've been working with a language/system for a while we forget what it was like to be a newbie, so your experiences are valuable and will help us improve or create documentation to make the onboarding process smoother. So I encourage you to open up documentation issues with reproducible bugs so we can address them.
Indeed, I am a newbie to Common Lisp and I have followed the tutorial (and this book - https://gigamonkeys.com/book/) to write my code. I really appreciate you taking the time and effort, and for providing a detailed explanation. Thank you :)
I'm having a hard time understanding why the code has not worked for you (as it is working ofr me)... could it be a typo when you created the dataframe to when you were adding the column? (titantic vs. titanic)
I'm running it again and it appears to be ok. Here's the full output
LS-USER> (defdf x (read-csv "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"))
X
LS-USER> x
#<DATA-FRAME X (891 observations of 12 variables)>
LS-USER> (add-column! x 'Relatives
(map-rows x '(x:sibsp x:parch)
#'(lambda (s p) (+ s p))))
#<DATA-FRAME X (891 observations of 13 variables)>
I'll make sure to review the documentation you've referenced.
I made an error in transcribing. I had the misspelling initially, and copied from that.
I do wonder how you're getting this to work though, I can't run that code successfully:
LS-USER> (add-column! x 'Relatives
(map-rows x '(x:sibsp x:parch)
#'(lambda (s p) (+ s p))))
throws me into the debugger:
Key SIBSP not found, valid keys are #(PASSENGERID SURVIVED
PCLASS NAME SEX AGE SIBSP
PARCH TICKET FARE CABIN
EMBARKED).
[Condition of type KEY-NOT-FOUND]
Restarts:
0: [RETRY] Retry SLIME REPL evaluation request.
1: [*ABORT] Return to SLIME's top level.
2: [ABORT] abort thread (#<THREAD "repl-thread" RUNNING {100BAF9E63}>)
Backtrace:
0: (DATA-FRAME::KEY-INDEX #<DATA-FRAME::ORDERED-KEYS PASSENGERID, SURVIVED, PCLASS, NAME, SEX, AGE, SIBSP, PARCH, TICKET, FARE, CABIN, EMBARKED> X:SIBSP)
1: (COLUMN #<DATA-FRAME (891 observations of 12 variables)> X:SIBSP)
You can see in the backtrace, at [1], the call to (column x x:sibsp)
is failing because the key can't be found. But, if it's working for you that's great.
BTW, the idiom I would use for this transformation is: (add-column! x 'relatives (e+ x:sibsp x:parch))
. In this pattern you should use PACKAGE:VARIABLE
because you're adding the values in the columns. I'm surprised PACKAGE:VARIABLE
works for you in map-rows
; it's not supposed to.
@cfpvl How are you going on this? Do you require any further assistance?
I still can't export the CSV file... and I could not figure out why the code was not working for @Symbolics I'm a newbie to Common Lisp and I appreciate your time and patience
Perhaps you should start with the code I provided in the examples. Does that work for you?
@cfpvl, have you got what you need? I'd like to close this issue if everything is working.
@snunez1 thanks for asking :) To be honest, I've been slowly learning more about Common Lisp, but I'm afraid I did not have enough time to put some real effort in this. You can go ahead and close this issue and if I'm still having errors going back to it, I'll open it again.
I'm learning CL and I was trying to play with the Titanic dataset to generate predictions using CL. I'm writing a tutorial as it helps me with the learning process.
However, using this fantastic package lisp-stat, I've encountered a few problems. One of them is that when done cleaning my dataframe, I could not export it as a CSV:
with the error
#<TYPE-ERROR expected-type: BIT datum: 3>
Running Ubuntu 20.04 (KDE Neon), sbcl-2:2.0.1-3.amd64, Emacs 28.1 (as a Snap package), slime 20220712.817.
Please, let me know if you need any additional information and how can I be of help.