cognoma / machine-learning

Machine learning for Project Cognoma
Other
32 stars 47 forks source link

Cognoma ml - RandomForestClassifer - For review/advice #36

Closed mans2singh closed 8 years ago

mans2singh commented 8 years ago

Hey Folks:

I have taken a stab random forest classification algorithm implementation. I've commented out some of the SGD related coeff from the code. Please let me know if there is any advice/recommendation to improve it.

Thanks

dhimmel commented 8 years ago

Nice work. Can you export your notebook to a .py file for commenting? You can do this by running the following command from the algorithms directory:

jupyter nbconvert --to=script --FilesWriter.build_directory=scripts RandomForestClassifier-mans2singh.ipynb
mans2singh commented 8 years ago

@dhimmel - I've converted the file with the command you provided. I think there is still room to tune params. Please let me know your thoughts/recommendations. Thanks

yl565 commented 8 years ago

Nice work. I like the graphs showing how AUROC increased with increasing depth. I'm curious how the RF performs when more features are included. The current n_feature_kept = 500 in 1.TCGA-MLexample.ipynb is for fast demonstration not for optimal classification. I think a lot of interesting things may happen if you increase it to 5000 or even more.

mans2singh commented 8 years ago

@yl565 - Thanks for your recommendation. I've increased the number of samples and also estimators. It does look like it has improved the performance. Please let me know if there is anything else required. Thanks for your help.

mans2singh commented 8 years ago

@gwaygenomics - I've updated the code based on your comments (added class_weight and used clf.featureimportance for coeffs). Please let me know if you have any other suggestions.

Thanks.

dhimmel commented 8 years ago

Nice analysis @mans2singh, just had the few comments I made above.

mans2singh commented 8 years ago

@dhimmel - I've removed the negative value cells. However, I am not sure how to use seaborn distplot. If you have any pointer, please let me know.

Thanks

dhimmel commented 8 years ago

Did you see the documentation for seaborn.displot?

mans2singh commented 8 years ago

@dhimmel - I will check it out. Thanks

yl565 commented 8 years ago

Try also grid search min_samples_split. You could compare your performance with the Adaboost: https://github.com/cognoma/machine-learning/pull/40#issuecomment-243418404

mans2singh commented 8 years ago

Hey folks - I am getting the following exception on using seaborn plots. This used to work earlier and I've done some google search but have not been able to resolve this issue. If anyone has any suggestion, please let me know.

[cc @dhimmel , @yl565 ] Thanks

<homedirectory>anaconda/envs/cognoma-machine-learning/lib/python3.5/site-packages/matplotlib/backends/backend_agg.py in print_png(self, filename_or_obj, *args, **kwargs)
    535             close = False
    536         try:
--> 537             _png.write_png(renderer._renderer, filename_or_obj, self.figure.dpi)
    538         finally:
    539             if close:

RuntimeError: libpng signaled error

<matplotlib.figure.Figure at 0x1c6185be0>
dhimmel commented 8 years ago

Error seems to occur on this line (although for a slightly earlier version of the matplotlib codebase).

Not sure what's going on. Are you using the cognoma-machine-learning environment? Maybe try reinstalling libpng within the environment by running: conda install --force libpng.

Also @mans2singh, note how I updated your previous post to add tripple backticks to put the error message in a code block.

mans2singh commented 8 years ago

@dhimmel - I will try your recommendation. Thanks.

mans2singh commented 8 years ago

@dhimmel - Your recommendation to install libpng worked. I've also added the distplot which is highly skewed though. @yl565 - I've also added min_samples_split param to cv grid and increased features to 5000.

Please let me know if you have any other recommendations.

Thanks for your feedback and help.

dhimmel commented 8 years ago

Nice, looks good to me. I'll let @yl565 do the merging (I recommend squashing) in case he has any additional comments.

yl565 commented 8 years ago

Overall this looks good. As expected greater tree depth gives better results as random forest usually benefit from larger trees with low bias. You should include None (represent no limit) as a value of max_depth to be grid searched otherwise the min_samples_split of range (2, 5) will not give different results . We have thousands of samples, try something like min_samples_split: [2, 10, 50, 250]

mans2singh commented 8 years ago

@yl565 - I've updated the notebook based on your comments. Please let me know if you have further suggestions.

yl565 commented 8 years ago

Use max_depth = [1, 2, 3, 4, None]. I would expect None gives better results for it allows creating fully grown trees. You may use a very large number like 1000 instead None to facilitate results visualization.

mans2singh commented 8 years ago

@yl565 - It was my oversight. I've changed max_depth to 1,10,100,1000 and corrected the axis labels. Please let me know if you see any other issue. Thanks

yl565 commented 8 years ago

Good work. Merged.