a. The process of constructing a predictive model involves several key stages: identifying pertinent variables, splitting the data, applying the model, inspecting to confirm the readiness of training and testing sets, interrogating to generate the model, and interpreting the models to address research inquiries.
b. Supervised learning involves setting variables as features towards a specific outcome variable.
c. To determine the appropriate model, types of models, algorithms, features, engines, and hyperparameters can be adjusted.
For me, this chapter appears to be more straightforward compared to unsupervised learning. While I may be able to follow the steps outlined in the textbook, executing them independently might be challenging due to the many functions mentioned. I am unsure which functions to use and how to navigate through them effectively if I need to do it on my own.
In this chapter, my focus is more on understanding the rationale behind using code and its associated content. For instance, "bootstrapping" (referenced from https://blog.csdn.net/qq_46110061/article/details/127826247) is illustrated vividly in the article with an analogy of threading the lace, going in and out with replacement. On the other hand, "cross-validation" (as discussed in https://blog.csdn.net/SongGu1996/article/details/100704276) is clearly explained. Additionally, the usage of set.seed (as explained in https://www.bilibili.com/read/cv25281382/) is highlighted to ensure the reproducibility of results, as different random numbers can lead to distinct outcomes, prompting some professors to utilize this function as a safeguard against plagiarism.
To help myself do independent predictive analysis, I found a video (https://www.youtube.com/watch?v=1xw915rbyG4) that mentions predictive analysis techniques. It begins by discussing regression, which explores relationships between variables. Then, it covers classification, similar to labeling things. Then, it mentions clustering, which involves grouping similar data points to differentiate between similar and dissimilar ones. Time series analysis focuses on how data evolves. Forecasting is predicting new data based on past observations. Problem statements, target variables, data size, and linear separability influence the choice of model. Linear regression is recommended for independent variables, logistic regression for binary outcomes, and decision trees are commonly used in mathematical fields such as finance, stocks, and brokerage.
@francojc
set.seed
(as explained in https://www.bilibili.com/read/cv25281382/) is highlighted to ensure the reproducibility of results, as different random numbers can lead to distinct outcomes, prompting some professors to utilize this function as a safeguard against plagiarism.