Open egetachew1 opened 10 months ago
Solid plan! For the prediction modeling part -- we don't do training/testing sets in Stat135 . . . have you all encountered this before in other stat courses or contexts?
Blog plan: 10/10
I have some experience with machine learning from a previous class I took in the summer where we used R and some Python. We used Introduction to Statistical Learning with Applications in R as a textbook.
We are open to exploring and furthering our knowledge on this topic. Please let us know if you have any recommended resources.
Best, Ephrata Getachew
On Nov 10, 2023, at 11:01 AM, Katharine Correia @.***> wrote:
Solid plan! For the prediction modeling part -- we don't do training/testing sets in Stat135 . . . have you all encountered this before in other stat courses or contexts?
Blog plan: 10/10
— Reply to this email directly, view it on GitHub https://github.com/acstat231-f23/blog-eea/issues/1#issuecomment-1806003679, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5SUZ3NOAZVWDGFA5OVT7I3YDZFVTAVCNFSM6AAAAAA7FJMO2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBWGAYDGNRXHE. You are receiving this because you authored the thread.
Solid plan! For the prediction modeling part -- we don't do training/testing sets in Stat135 . . . have you all encountered this before in other stat courses or contexts?
Blog plan: 10/10
I also have an experience building prediction models both within the context of a class and outside of it. In addition, I still have access to some of the DataCamp-based tutorials, which are, although taught within the framework of Python, I believe, could be translated into R and implemented in RStudio.
Status Update 1
In this week's checkpoint, we did data wrangling, focusing on merging and reshaping datasets related to macroeconomic indicators. We combined data from gni_gdp_lifeexp.csv
and expected_years_of_schooling.csv
renaming columns for clarity and filtering data from 2000 to 2022. However, a challenge that emerged during this process was how to address missing values. We encountered uncertainties regarding how to effectively handle and impute missing values to ensure the integrity of our analysis.
OK, so you are on track! We discussed the missingness, I imputed values, and your dataset is all set (I think).
Status Update 1: 5/5
Status Update 2
We used k-means clustering to categorize countries based on their economic data. We chose four clusters and focused on recent data for each country, considering life expectancy, GNI per capita, and expected years of schooling. To make comparisons fair, we standardized these factors. We added this clustering information back into our dataset called macro_trends_with_clusters.
Additionally, we created an interactive 3D scatter plot using plotly
to show how countries are grouped based on these economic indicators. The main issue we had was with scaling, but we resolved it. Next, we will be working on the aesthetics/interface part.
Great!
Status Update 2: 5/5
1. Do you plan for your final project to be an extension of the mid-semester project?
The final project will be an extension of the mid-semester project: it will still explore macroeconomic trends across different countries, but, this time around, centering a different set of data. Namely, we will be working with indicators contributing to the Human Development Index, like life expectancy at birth, expected and mean years of schooling, and GNI per capita. We will implement unsupervised learning to cluster the data over mentioned variables and predict what category a country might belong to with respect to its HDI. And since there’s an already existing HDI Rank list, we would be able to conduct cross-comparison between the predicted and actual clusters and evaluate how accurate our model is.
In addition to clustering, we would like to attempt building a prediction model that would predict the country’s GDP based on HDI indicators (individual and/or combined) and compare the accuracy of different predictions. To do that we will implement a supervised learning algorithm and partition our data to train and test our model.
2. Describe what you hope to deliver as a final product. Will your blog include a published Shiny application? Will it incorporate an interactive map? Will it involve a predictive model that forecasts future values of some quantity using data that you’ve integrated?
The final product will be a comprehensive blog post that explores macroeconomic trends based on Human Development Index (HDI) indicators. The blog will include results of unsupervised learning for clustering countries based on HDI indicators, a predictive model to forecast a country's GDP using HDI indicators and reproducible code and explanations. The blog does not include a Shiny application or interactive map, it will focus on in-depth analysis and insights from the data, providing valuable information to the users.
3. Outline a schedule for your group’s progress that will take you from now (ideas phase) to final blog post and presentation at the end of the semester. During the last project, we had specific checkpoints for different phases of the project. Based on what you envision for your final blog post, identify checkpoints for your group and dates by which you plan to reach those checkpoints. Hold each other accountable, so you’re not waiting until the last minute to do things! In particular, you should have at least one checkpoint each week (ideally two) identifying what work you expect to complete by then.