UBC-MDS / DSCI522_group_12

MIT License
0 stars 5 forks source link

Project setup: Topic proposals #3

Closed d-sel closed 3 years ago

d-sel commented 3 years ago

Please submit proposals for topics.

Team suggestions for decision to occur by end of day Thurs. Nov 19. Voting by noon Friday. (Updated from 8am to get clarification on files).

HazelJJJ commented 3 years ago

http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

This is a dataset on default of credit card clients. It has 23 features and target is whether it exists default payments of the client. I am thinking we could work on a prediction question and build a ML model on this dataset.

d-sel commented 3 years ago

Dataset relating to absenteeism at work. It has 21 features and 740 records. It has interesting features and is not too complicated. Could be a prediction question relating to if an employee would be absent based on time of the year or due to workload, etc.

http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

HazelJJJ commented 3 years ago

Dataset relating to absenteeism at work. It has 21 features and 740 records. It has interesting features and is not too complicated. Could be a prediction question relating to if an employee would be absent based on time of the year or due to workload, etc.

http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

I like your idea and dataset, but the data is a zip file. We will need to do work to read the data :(

d-sel commented 3 years ago

Dataset relating to absenteeism at work. It has 21 features and 740 records. It has interesting features and is not too complicated. Could be a prediction question relating to if an employee would be absent based on time of the year or due to workload, etc. http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

I like your idea and dataset, but the data is a zip file. We will need to do work to read the data :(

Thanks for noticing this. I searched on slack and it looks like it's acceptable to download the csv in data folder and use read_csv (see #522 channel: 'Hi, for the milestone 1, do we have to use url to download data? Could we download the zip file and then read_csv() from the data folder?' 'Yes, you can totally do that!'

HazelJJJ commented 3 years ago

Dataset relating to absenteeism at work. It has 21 features and 740 records. It has interesting features and is not too complicated. Could be a prediction question relating to if an employee would be absent based on time of the year or due to workload, etc. http://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

I like your idea and dataset, but the data is a zip file. We will need to do work to read the data :(

Thanks for noticing this. I searched on slack and it looks like it's acceptable to download the csv in data folder and use read_csv (see #522 channel: 'Hi, for the milestone 1, do we have to use url to download data? Could we download the zip file and then read_csv() from the data folder?' 'Yes, you can totally do that!'

Thank you for pointing it out. I searched it on Slack and I asked my concern about reproductivity under that thread. If reproductivity is not a concern, I like your data since it is less complex than mine.

larahabashy commented 3 years ago

Here is the data set I suggested in lab earlier. A binary classification of income levels would be applicable here with lots of the tools we learned in 571.

larahabashy commented 3 years ago

http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients

This is a dataset on default of credit card clients. It has 23 features and target is whether it exists default payments of the client. I am thinking we could work on a prediction question and build a ML model on this dataset.

I really like this data set! The features are easy to interpret and the target class is binary which will make our lives easier. I've also checked to see the target class has some imbalance. This data set has my vote!

larahabashy commented 3 years ago

Dataset relating to absenteeism at work. It has 21 features and 740 records. It has interesting features and is not too complicated. Could be a prediction question relating to if an employee would be absent based on time of the year or due to workload, etc.

Super interesting data! Would the target variable be multiclass then? One for each absence reason in the data I'm assuming?

d-sel commented 3 years ago

Hi all,

Just a general comment, I asked about whether zip files were OK and they are; no issues with reproducibility as there are functions to download the zip files and save them programmatically.

with R: download.file() how to do it instructions with Python: read_csv(unz(“yourfile.zip”))

I also found some more datasets on absenteeism, including one from StatsCan and connecting it to presence/not presence of children. Just as another option: https://open.canada.ca/data/dataset/f46e97bd-09e9-4320-b963-bb5bf579d619?=undefined&wbdisable=true

d-sel commented 3 years ago

Hi Everyone, I vote that we use one of the absenteeism data sets. After reviewing the credit card transaction ones, I am not sure about how we can approach the different features as they seem to be the same feature repeated over different periods of time. If someone is confident about how to tackle this, I'm OK with using that one, as well.

larahabashy commented 3 years ago

Hey @d-sel, what would the response/target variable be for the absenteeism data set? I couldn't find a binary one.

As for the credit card transactions, one thing we could do is have a time series in a nested data frame for the changes over different periods.

d-sel commented 3 years ago

Hey @d-sel, what would the response/target variable be for the absenteeism data set? I couldn't find a binary one.

As for the credit card transactions, one thing we could do is have a time series in a nested data frame for the changes over different periods.

Hi Lara, I think you're right - the one I had provided had multi-class targets.

For credit cards, I don't believe we've learned how to work with time series in this situation. Can you maybe elaborate on how we would phrase our statistical question with the timeseries in mind?

larahabashy commented 3 years ago

Hi Selma, I proposed time series as a way of making sense of the features of monthly payments (ie. X6 - X11 together, X12-X17 together, etc) -- and tidying up the data by nesting those together for each row. The variables would still be integrated into our classification model as explanatory variables.

The statistical question would be predicting default payments (1 or 0) given some information such as sex and age and credit card history.

d-sel commented 3 years ago

Hi Selma, I proposed time series as a way of making sense of the features of monthly payments (ie. X6 - X11 together, X12-X17 together, etc) -- and tidying up the data by nesting those together for each row. The variables would still be integrated into our classification model as explanatory variables.

The statistical question would be predicting default payments (1 or 0) given some information such as sex and age and credit card history.

I am OK to go ahead with this.

HazelJJJ commented 3 years ago

So is the credit card dataset our final decision?

larahabashy commented 3 years ago

So is the credit card dataset our final decision?

I guess so! Unless someone has better suggestions, we should go ahead and start working on the proposal and the EDA.