Closed alisongh closed 2 years ago
Make sure that don't use the existing or published projects online
Project Idea: What gender is this name?
Inspiration: The Most Common Unisex Names In America: Is Yours One Of Them? https://fivethirtyeight.com/features/there-are-922-unisex-names-in-america-is-yours-one-of-them/
Dataset Sources: • https://www.back4app.com/database/back4app/list-of-names-dataset • https://data.world/datasets/names
Supervised Problem: Given a name, predict if belongs to male or female (or unisex) name • Multiclass classification
Unsupervised Problem: Name clustering to explore name clusters (from similar roots?/ more common names vs non common names?/ name of different origins?) • Clustering
If you remember we did a comic book exercise in our data viz class: https://fivethirtyeight.com/features/women-in-comic-books/ We can actually take these comic characters’ name as unseen data after the model is done.
Project Idea: Predict Rental Prices Umich library has dataset in dataplanet for download but requires manually clicking through all the states and counties and bedroom types, export as excel, and clean all the files. The data is an aggregated rental price for the year, sample csv file below:
Project Idea: Emotion Recognition through Tweets
Inspiration: Analysis of Emotion Data: A Dataset for Emotion Recognition Tasks https://towardsdatascience.com/analysis-of-the-emotion-data-a-dataset-for-emotion-recognition-tasks-6b8c9a5dfe57
Dataset: https://huggingface.co/datasets/emotion
Good that the data is already preprocessed and someone already did some EDA on the dataset, though the dataset is also avaliable on Kaggle, no one has done much with it yet.
Project Idea: India Bank Customer Segmentation
Source: https://www.kaggle.com/datasets/shivamb/bank-customer-segmentation
This has 1M+ transactions to play with. it's on Kaggle and only 2 people really did something on clustering but both don't have much in-depth interpretation from their results, they didn't do much feature engineering and it feels like the ML techniques are applied only for the sake of applying. With this extensive amount of data, we can actually do a lot of things!
Here are some suggested by the author:
We can also do:
Project Idea: Paid Parking Demand Prediction
Dataset: https://data.seattle.gov/Transportation/Paid-Parking-Transaction-Data/gg89-k5p6
The City of Seattle has made paid parking transaction data set available for public use for research and entrepreneurial purposes under the City’s Open Data Program. This dataset is derived from parking pay stations placed on streets within city limits and the paid-by-phone parking transactions. The dataset is downloaded nightly with the prior days paid parking transaction data.
My Comment: There are about 192K records detailing each meter transaction. We can do some clustering to see what we can find out from the data, and we can do supervised learning to predict future meter usage for x number of periods. The data is fairly clean. It's not something a lot of people have already done something about.
The Seattle govt has this open data program and actually there are lots of interesting clean data we can explore!
https://data.seattle.gov/browse?sortBy=most_accessed&utf8=%E2%9C%93
Project Idea: Predict Rental Prices Umich library has dataset in dataplanet for download but requires manually clicking through all the states and counties and bedroom types, export as excel, and clean all the files. The data is an aggregated rental price for the year, sample csv file below:
Similar to the house price prediction