Open coderschoolreview opened 6 years ago
Assignment 2
The goal of this assignment was to introduce you to three new classification techniques and to understand how to select the best parameters and features for them. You learned how to use python built-in functions (GridSearchCV, SelectKBest, RFE, SelectFromModel) to try out new models (Support Vector Machines, Random Forests, and Logistic Regression) and test different permutations of parameter values and features, and analyze your results to help build better machine learning models.
Great job! Given that you don't have a programming background, your progress and work is really quite exceptional.
Here's what you did really well:
Some suggestions:
ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
You can try fixing this by increasing the max_iter
parameter in GridSearchCV
(the default is 100).Overall, amazing work. Keep it up !!!
Assignment 3
The goal of this assignment was to introduce you to three new Natural Language Processing techniques, and to understand how to perform some basic sentiment analysis on song lyrics using these methods. You learned how to clean and prepare textual information for NLP, and then apply the following approaches: Bag Of Words, TF-IDF, and Doc2Vec. You used your prior knowledge of Python estimators, feature selection, and parameter optimization techniques to produce feature vectors from these NLP methods to make predictions on the moods of songs using their lyrics.
As usual -- great job with the assignment! You always present your work in a clear, structured way showing your thought process and exploring lots of different options and combinations of classifiers, parameters, and feature optimization techniques.
Here are a couple of notes:
Sometimes you notice that your GridSearchCV
best estimator returns a score lower than what you seemed to have gotten before GridSearchCV
. This might be because your initial score was done on a single train_test_split
, whereas GridSearchCV
actually uses cross-validation internally! So the 2 scores can't really be compared. Instead, try finding the mean cross-validated score before GridSearchCV
, then get the best estimator, and then find the mean cross-validated score of the best estimator. This should give you a more accurate comparison.
RFE
works via recursion (you can learn more about it here), and it can generally take a long time — trying 5,737 features for an RFE
is an exorbitant amount! This is because in RFE
, it re-runs the fit
and predict
steps every time, re-evaluating feature importances, re-trimming down the features, and repeating the process all over again in every cycle. So be careful. In such cases, it will be a better idea to use SelectKBest
or SelectFromModel
as these are simple iterative processes and will take much less time (as you found out!).
Overall, great work!
The goal of this assignment was to introduce you to 2 main concepts in Machine Learning: Data Pre-processing, and Classification. You learned how to query and clean data using the pandas library in Python, and built a simple Machine Learning Classifier based on the K Nearest Neighbors algorithm.
Things you did well:
unique
,sort
,sample
,value_counts
,dropna
. Great job!One minor tip:
You don't necessarily need to iterate through an array to print it, i.e.
for i in view_genres: print(i)
can be substituted with just
view_genres
orview_genres.tolist()
etc.Overall, excellent work! You are demonstrating that you are understanding the material and doing a great job of applying it. Keep it up!