ataki / hospitalfinder

Use ML to predict best hospitals for patients
1 stars 0 forks source link

Naive Bayes Feature Selection Issues #2

Open ataki opened 10 years ago

ataki commented 10 years ago

A lot of cross-validations are returning 0.0, implying that it's often hard for feature selection to distinguish between feature sets.

Tested with only 500 examples and 50 features; not sure how long the entire dataset + features would take, but want to debug before then.

Question: Is it normal for forward search CV to return somewhat different sets of features (i.e. 45-50% of the features change between runs of foward search).

scottcheng commented 10 years ago

I guess if the data remains the same, and your learning algorithm and CV algorithm are deterministic, then forward search should return the same set of results each time.

ataki commented 10 years ago

Well, CV isn't - it randomly selects 70% of the data and tests on the other 30%, so we could get different training data each cycle.

scottcheng commented 10 years ago

Oh you are right, simple CV and k-fold CV are nondeterministic... My bad.

scottcheng commented 10 years ago

Also did you see my question on your CV code here?

petousis commented 10 years ago

Hi Guys,

My understanding is that both forward and backward search algorithms as applied in the context of feature selection are greedy search algorithms and hence are not guaranteed to find the optimum set of features. Hence, I think it is normal that they do not give the same answer.

On Tue, Nov 12, 2013 at 5:19 PM, Scott Cheng notifications@github.comwrote:

Also did you see my question on your CV code herehttps://github.com/jimzheng/hospitalfinder/commit/1107d658b26b5568313596387d48b1a5e016d94d#commitcomment-4572662 ?

— Reply to this email directly or view it on GitHubhttps://github.com/jimzheng/hospitalfinder/issues/2#issuecomment-28352567 .

scottcheng commented 10 years ago

I agree they are greedy, but I think being greedy algorithms doesn't necessarily mean they have randomness?

On Tue, Nov 12, 2013 at 8:37 PM, petousis notifications@github.com wrote:

Hi Guys,

My understanding is that both forward and backward search algorithms as applied in the context of feature selection are greedy search algorithms and hence are not guaranteed to find the optimum set of features. Hence, I think it is normal that they do not give the same answer.

On Tue, Nov 12, 2013 at 5:19 PM, Scott Cheng notifications@github.comwrote:

Also did you see my question on your CV code here< https://github.com/jimzheng/hospitalfinder/commit/1107d658b26b5568313596387d48b1a5e016d94d#commitcomment-4572662>

?

— Reply to this email directly or view it on GitHub< https://github.com/jimzheng/hospitalfinder/issues/2#issuecomment-28352567>

.

— Reply to this email directly or view it on GitHubhttps://github.com/jimzheng/hospitalfinder/issues/2#issuecomment-28363801 .

Scott Ye Cheng Computer Science Department, Stanford University http://scottcheng.com/ me@scottcheng.com (650) 888 - 6054

petousis commented 10 years ago

Yes, I agree. Depending on the sequence in which you are feeding the algorithm data, you might get different answer.

On Tue, Nov 12, 2013 at 8:42 PM, Scott Cheng notifications@github.comwrote:

I agree they are greedy, but I think being greedy algorithms doesn't necessarily mean they have randomness?

On Tue, Nov 12, 2013 at 8:37 PM, petousis notifications@github.com wrote:

Hi Guys,

My understanding is that both forward and backward search algorithms as applied in the context of feature selection are greedy search algorithms and hence are not guaranteed to find the optimum set of features. Hence, I think it is normal that they do not give the same answer.

On Tue, Nov 12, 2013 at 5:19 PM, Scott Cheng notifications@github.comwrote:

Also did you see my question on your CV code here<

https://github.com/jimzheng/hospitalfinder/commit/1107d658b26b5568313596387d48b1a5e016d94d#commitcomment-4572662>

?

— Reply to this email directly or view it on GitHub<

https://github.com/jimzheng/hospitalfinder/issues/2#issuecomment-28352567>

.

— Reply to this email directly or view it on GitHub< https://github.com/jimzheng/hospitalfinder/issues/2#issuecomment-28363801>

.

Scott Ye Cheng Computer Science Department, Stanford University http://scottcheng.com/ me@scottcheng.com (650) 888 - 6054

— Reply to this email directly or view it on GitHubhttps://github.com/jimzheng/hospitalfinder/issues/2#issuecomment-28364061 .

ataki commented 10 years ago

Yep, that's what I'm finding out. Anyway, thanks @scottcheng for finding an error in my code. I misunderstood the description Andrew had in his class notes - we always try to find minError when doing CV. Fixing now.

Good thing I named my variables descriptively:P

ataki commented 10 years ago

Ok, I'm getting better, more consistent rows now. There's still variability in the features being selected, but several features are showing up more consistently.

One last step before I can move on is to fine-tune performance on NB; I'm afraid of running this with 30k rows, because it might take up to 15-30 min for cross validation to run.