data-mining-group-project / mood-classifier

Spotify Mood Classifier based on playlist names (Sad or Happy)
0 stars 3 forks source link

Research Random Forest #20

Open elisabettad opened 5 years ago

elisabettad commented 5 years ago

Understand how to applyRandom Forest to the project, scaling and standardisation

elisajw commented 5 years ago

Definition: uses a bagging approach to create a bunch of decision trees with a random subset of the data. Same as the Decision tree but with more subsets of the data (more start points). It combines the output of multiple decision trees and then finally come up with its own output. Random Forest works on the same principle as Decision Tress; however, it does not select all the data points and variables in each of the trees. It randomly samples data points and variables in each of the tree that it creates and then combines the output at the end. It removes the bias that a decision tree model might introduce in the system. Also, it improves the predictive power significantly. We will see this in the next section when we take a sample data set and compare the accuracy of Random Forest and Decision Tree.

Use: The final prediction of the random forest algorithm is derived by polling the results of each decision tree or just by going with a prediction that appears the most times in the decision trees.

Libraries in R to implement is randomForest. Code sample can be found in this link https://www.r-bloggers.com/how-to-implement-random-forests-in-r/

Do we need to scale the data using Random forest? No - Random Forests are based on tree partitioning algorithms. There's no analog to a coefficient one obtain in general regression strategies, which would depend on the units of the independent variables. Instead, one obtains a collection of partition rules, basically a decision given a threshold, and this shouldn't change with scaling. In other words, the trees only see ranks in the features. There is a section on the code that can be done by using the below
Fine tuning parameters of Random Forest model model2 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE) model2 If you enter mtry as one of the conditions it will scale accordingly. If no, the default setting by R is 2

https://stackoverflow.com/questions/8961586/do-i-need-to-normalize-or-scale-data-for-randomforest-r-package

AnneliseCanesso commented 5 years ago

https://drive.google.com/drive/folders/1re-aC6ewDAQNEXukk2gjqTUBrVN_t_A-?usp=sharing

Il giorno dom 7 apr 2019 alle 23:50 elisajw notifications@github.com ha scritto:

Definition: uses a bagging approach to create a bunch of decision trees with a random subset of the data. Same as the Decision tree but with more subsets of the data (more start points). It combines the output of multiple decision trees and then finally come up with its own output. Random Forest works on the same principle as Decision Tress; however, it does not select all the data points and variables in each of the trees. It randomly samples data points and variables in each of the tree that it creates and then combines the output at the end. It removes the bias that a decision tree model might introduce in the system. Also, it improves the predictive power significantly. We will see this in the next section when we take a sample data set and compare the accuracy of Random Forest and Decision Tree.

Use: The final prediction of the random forest algorithm is derived by polling the results of each decision tree or just by going with a prediction that appears the most times in the decision trees.

Libraries in R to implement is randomForest. [More info https://www.r-bloggers.com/how-to-implement-random-forests-in-r/]

Do we need to scale the data using Random forest? No - Random Forests are based on tree partitioning algorithms. There's no analog to a coefficient one obtain in general regression strategies, which would depend on the units of the independent variables. Instead, one obtain a collection of partition rules, basically a decision given a threshold, and this shouldn't change with scaling. In other words, the trees only see ranks in the features.

https://stackoverflow.com/questions/8961586/do-i-need-to-normalize-or-scale-data-for-randomforest-r-package

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/data-mining-group-project/mood-classifier/issues/20#issuecomment-480637221, or mute the thread https://github.com/notifications/unsubscribe-auth/ArbiE40wRBLF-Hx_WjLyczmpFt1sbVymks5venYlgaJpZM4ceoky .

-- Best Regards, Annelise Canesso