Closed mans2singh closed 8 years ago
Nice work. Can you export your notebook to a .py file for commenting? You can do this by running the following command from the algorithms
directory:
jupyter nbconvert --to=script --FilesWriter.build_directory=scripts RandomForestClassifier-mans2singh.ipynb
@dhimmel - I've converted the file with the command you provided. I think there is still room to tune params. Please let me know your thoughts/recommendations. Thanks
Nice work. I like the graphs showing how AUROC increased with increasing depth. I'm curious how the RF performs when more features are included. The current n_feature_kept = 500
in 1.TCGA-MLexample.ipynb
is for fast demonstration not for optimal classification. I think a lot of interesting things may happen if you increase it to 5000 or even more.
@yl565 - Thanks for your recommendation. I've increased the number of samples and also estimators. It does look like it has improved the performance. Please let me know if there is anything else required. Thanks for your help.
@gwaygenomics - I've updated the code based on your comments (added class_weight and used clf.featureimportance for coeffs). Please let me know if you have any other suggestions.
Thanks.
Nice analysis @mans2singh, just had the few comments I made above.
@dhimmel - I've removed the negative value cells. However, I am not sure how to use seaborn distplot. If you have any pointer, please let me know.
Thanks
Did you see the documentation for seaborn.displot
?
@dhimmel - I will check it out. Thanks
Try also grid search min_samples_split. You could compare your performance with the Adaboost: https://github.com/cognoma/machine-learning/pull/40#issuecomment-243418404
Hey folks - I am getting the following exception on using seaborn plots. This used to work earlier and I've done some google search but have not been able to resolve this issue. If anyone has any suggestion, please let me know.
[cc @dhimmel , @yl565 ] Thanks
<homedirectory>anaconda/envs/cognoma-machine-learning/lib/python3.5/site-packages/matplotlib/backends/backend_agg.py in print_png(self, filename_or_obj, *args, **kwargs)
535 close = False
536 try:
--> 537 _png.write_png(renderer._renderer, filename_or_obj, self.figure.dpi)
538 finally:
539 if close:
RuntimeError: libpng signaled error
<matplotlib.figure.Figure at 0x1c6185be0>
Error seems to occur on this line (although for a slightly earlier version of the matplotlib codebase).
Not sure what's going on. Are you using the cognoma-machine-learning
environment? Maybe try reinstalling libpng
within the environment by running: conda install --force libpng
.
Also @mans2singh, note how I updated your previous post to add tripple backticks to put the error message in a code block.
@dhimmel - I will try your recommendation. Thanks.
@dhimmel - Your recommendation to install libpng worked. I've also added the distplot which is highly skewed though. @yl565 - I've also added min_samples_split param to cv grid and increased features to 5000.
Please let me know if you have any other recommendations.
Thanks for your feedback and help.
Nice, looks good to me. I'll let @yl565 do the merging (I recommend squashing) in case he has any additional comments.
Overall this looks good. As expected greater tree depth gives better results as random forest usually benefit from larger trees with low bias. You should include None
(represent no limit) as a value of max_depth
to be grid searched otherwise the min_samples_split
of range (2, 5) will not give different results . We have thousands of samples, try something like min_samples_split
: [2, 10, 50, 250]
@yl565 - I've updated the notebook based on your comments. Please let me know if you have further suggestions.
Use max_depth = [1, 2, 3, 4, None]. I would expect None
gives better results for it allows creating fully grown trees. You may use a very large number like 1000 instead None
to facilitate results visualization.
@yl565 - It was my oversight. I've changed max_depth to 1,10,100,1000 and corrected the axis labels. Please let me know if you see any other issue. Thanks
Good work. Merged.
Hey Folks:
I have taken a stab random forest classification algorithm implementation. I've commented out some of the SGD related coeff from the code. Please let me know if there is any advice/recommendation to improve it.
Thanks