Datasets and Research Questions

paradise1260 commented 2 years ago

Hello team,

Please provide 2 research questions and datasets under this issue.

GloriaWYY commented 2 years ago

Glass Identification:

Classify the types of glasses (7 types of glasses) based on the refraction index, weight percentage in the corresponding oxide of sodium, magnesium, etc.
10 features: all numerical
shape: 214 rows by 11 columns
Target with 10 classes
Can do decision tree, SVC, logistic regression
Potential problem: a lot of 0's in the data set, is it the true value of the attribute or missing value?

Diabetes prediction dataset

Classify whether a patient has diabetes based on multiple indicators/symptoms such as age, gender, whether there is a sudden weight loss or not (yes/no).
16 features: age is numerical, the others are binary (gender, presence of a symptom)
shape: 520 rows by 17 columns
target is binary: positive/negative
Can do decision tree, SVC, logistic regression

gfairbro commented 2 years ago

Wine Quality Classification:

Question: What features most affect a wine's quality rating? We can use Logistic Regression and map out the coefficients that most affect the wine. Or simply do an ML classification model, using any of the more advanced models we have learned.

Over 6000 examples, 12 features target is multiclass :

gfairbro commented 2 years ago

Stock Performance projection:

Question: Predict which stocks will have the highest rate of return the following week based on features and performance from the week prior. Alternatively we can also discuss which indicators are most likely to affect a positive performance in the following week. This is a regression problem not a classification problem.

750 examples, 16 features.

paradise1260 commented 2 years ago

Online shoppers purchasing intention

Question: Predict whether an online shopper will end up shopping or not.

The dataset has 12330 examples and 17 features. The task is classification.

Defaulting credit cards

Question: Predict whether a client will default their credit card or not.

The dataset has 30000 examples and 23 features. The task is classification.

Luming-ubc commented 2 years ago

I have found two suitable datasets. Reasons: Both of the dataset are relatively clean and easy to work with. The description pages linked above are worth reading (very clear and straight to the point). Data.info() provides insights of the raw dataset.

Predict Student performance (Regression problem) https://archive-beta.ics.uci.edu/ml/datasets/student+performance data.info() RangeIndex: 649 entries, 0 to 648 Data columns (total 33 columns): Column Non-Null Count Dtype 0 school 649 non-null object 1 sex 649 non-null object 2 age 649 non-null int64 3 address 649 non-null object 4 famsize 649 non-null object 5 Pstatus 649 non-null object 6 Medu 649 non-null int64 7 Fedu 649 non-null int64 8 Mjob 649 non-null object 9 Fjob 649 non-null object 10 reason 649 non-null object 11 guardian 649 non-null object 12 traveltime 649 non-null int64 13 studytime 649 non-null int64 14 failures 649 non-null int64 15 schoolsup 649 non-null object 16 famsup 649 non-null object 17 paid 649 non-null object 18 activities 649 non-null object 19 nursery 649 non-null object 20 higher 649 non-null object 21 internet 649 non-null object 22 romantic 649 non-null object 23 famrel 649 non-null int64 24 freetime 649 non-null int64 25 goout 649 non-null int64 26 Dalc 649 non-null int64 27 Walc 649 non-null int64 28 health 649 non-null int64 29 absences 649 non-null int64 30 G1 649 non-null int64 31 G2 649 non-null int64 32 G3 649 non-null int64 dtypes: int64(16), object(17) memory usage: 167.4+ KB
Classify origin of wines (Classification problem) https://archive-beta.ics.uci.edu/ml/datasets/wine Use 13 features to classify one of three origins; 178 entries 1) Alcohol 2) Malic acid 3) Ash 4) Alcalinity of ash 5) Magnesium 6) Total phenols 7) Flavanoids 8) Nonflavanoid phenols 9) Proanthocyanins 10)Color intensity 11)Hue 12)OD280/OD315 of diluted wines 13)Proline

Notes: I also think the wine quality dataset is interesting (Regression problem). https://archive-beta.ics.uci.edu/ml/datasets/wine+quality RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): Column Non-Null Count Dtype
0 fixed acidity 1599 non-null float64 1 volatile acidity 1599 non-null float64 2 citric acid 1599 non-null float64 3 residual sugar 1599 non-null float64 4 chlorides 1599 non-null float64 5 free sulfur dioxide 1599 non-null float64 6 total sulfur dioxide 1599 non-null float64 7 density 1599 non-null float64 8 pH 1599 non-null float64 9 sulphates 1599 non-null float64 10 alcohol 1599 non-null float64 11 quality 1599 non-null int64
dtypes: float64(11), int64(1) memory usage: 150.0 KB

I was thinking including both two datasets to form some research idea, since they are very close, just in case we are short of work to do. We could also just take one. :)

gfairbro commented 2 years ago

Here are my top 4:

Wine Quality
Diabetes
Stock Prediction
CC Default

Luming-ubc commented 2 years ago

My top 4:

Wine quality (interesting to me, preferred)
Others (also good datasets and doable): Default credit card, diabetes, student performance

GloriaWYY commented 2 years ago

My top 4:

Online shopping
Wine quality
Credit cards
Diabetes

All the datasets we chose seem very interesting, but I am thinking that a larger data set might be better, so here comes my ranking.

paradise1260 commented 2 years ago

My top 4:

Wine quality
Online shopping
Student performance
Diabetes

gfairbro commented 2 years ago

Looks like we have a winner! I Love me some wine... 👍

My follow up for the formal question: Predict a wine's quality based on a set of chemical composition features

Possible stretch questions:

What are the top contributing features to a high quality wine?
Combining a second dataset, can we determine a probability that a wine came from 3 regions from the Classify origin of wines dataset

Luming-ubc commented 2 years ago

Add possible repo rename: Wine Quality Predictor

UBC-MDS / DSCI_522_group09_Wine_Quality_Predictor

Datasets and Research Questions #3