Ideas for EDA/Visualizations

jraza19 commented 3 years ago

Please provide your ideas for the EDA and type of visualizations that would work well

jraza19 commented 3 years ago

We have lots of categorical features and numeral features in our diabetes dataset:

compare the different numerical variables against one another - a repeated histograms comparing one numerical variable in the x compared to another variable
do overlapping feature histograms to compare against the target (balanced dataset)
look at the correlation of the variables against targets in point correlation matrix

rachelywong commented 3 years ago

For extracting the diabetes_csv dataset only from the zip file we can use this code:

# Packages necessary for importing data (from a zip file containing 2 dataset CSVs)
import requests, zipfile
from urllib.request import urlopen
from io import BytesIO

zip_file_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip"
zip_file_load = urlopen(zip_file_url)
zipinmemory = BytesIO(zip_file_load.read())
zip_file = zipfile.ZipFile(zipinmemory)
# Only load the first file in the zip folder
diabetes_csv = pd.read_csv(zip_file.open(zip_file.namelist()[0]))
print(diabetes_csv.head())

rachelywong commented 3 years ago

Hi! Please take a look at the current EDA report from my fork (group29-rachel/reports/EDA_initial.ipynb)and let me know your thoughts on:

Which features should we discard if any? (I am thinking encounter_id, patient_nbr, maybe some others)
How should we go about features with many missing values (ex// weight)?
We need to fix the target column (readmitted vs. not readmitted)

jraza19 commented 3 years ago

Hi Rachel, please see below for my response for part 3. I think this code works to fix the target column

pattern = r'[<>]30'
diabetes_csv["readmitted"] = diabetes_csv["readmitted"].str.replace(pattern,"YES",regex = True)

Hi! Please take a look at the current EDA report from my fork (group29-rachel/reports/EDA_initial.ipynb)and let me know your thoughts on:

Which features should we discard if any? (I am thinking encounter_id, patient_nbr, maybe some others)

How should we go about features with many missing values (ex// weight)?

We need to fix the target column (readmitted vs. not readmitted)

rachelywong commented 3 years ago

For 1. Which features should we discard if any?

Link to feature descriptions: https://www.hindawi.com/journals/bmri/2014/781670/tab1/ I have looked through the features and I think these ones would be useful to drop:

encounter_id (useless)
patient_nbr (useless)
weight (not enough data, 97% missing)
Payer code (useless, 52% missing)
Medical specialty (useless, 53% missing)

Let me know your thoughts :)

sukh2929 commented 3 years ago

I agree Rachel that we can drop the features that you mentioned. We can also drop "race"(to avoid racial bias) feature as well.

rachelywong commented 3 years ago

I agree Rachel that we can drop the features that you mentioned. We can also drop "race"(to avoid racial bias) feature as well.

I was thinking about race, but maybe race might be good to keep in because of potential racial bias or biological characteristics of different races affecting patient readmission. For example, if it was more likely that Asian patients were readmitted more often than Hispanic patients, there could be underlying reasons for this.

rachelywong commented 3 years ago

As agreed upon during our zoom meeting in class:

-Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia

Overlapping feature histograms to compare against the target (balanced dataset)

-Look at the correlation of the variables against targets in point correlation matrix @Sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @Javairia -remember to save png files

rachelywong commented 3 years ago

For 1. Which features should we discard if any?

Link to feature descriptions: https://www.hindawi.com/journals/bmri/2014/781670/tab1/ I have looked through the features and I think these ones would be useful to drop:

encounter_id (useless)

patient_nbr (useless)

weight (not enough data, 97% missing)

Payer code (useless, 52% missing)

Medical specialty (useless, 53% missing)

Let me know your thoughts :)

Just noticed that examide and citoglipton had 100% of responses as NO. I'll remove these features from our analysis, since they won't be helpful in answering our question.

jraza19 commented 3 years ago

I agree Rachel that we can drop the features that you mentioned. We can also drop "race"(to avoid racial bias) feature as well.

I was thinking about race, but maybe race might be good to keep in because of potential racial bias or biological characteristics of different races affecting patient readmission. For example, if it was more likely that Asian patients were readmitted more often than Hispanic patients, there could be underlying reasons for this.

I have been rethinking about whether or not to keep the race column. So as you can see from below, there is way more of one race compared to the others but the distributions of yes/no between the different races is very similar. I think race can be removed because in my experience, the population who access a hospital more often is dependent on the location of the hospital which will not help us with our prediction. Let me know your thoughts about this. @rachelywong @sukh2929 @wiwang

jraza19 commented 3 years ago

As agreed upon during our zoom meeting in class:

-Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia

Overlapping feature histograms to compare against the target (balanced dataset)

-Look at the correlation of the variables against targets in point correlation matrix @sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @javairia -remember to save png files

Even after dropping all the columns that rachel suggested - we still have 29 variables so graphs comparing the numerical columns against each other. How would we like to eliminate some? Perhaps the paper will provide guidance or the correlation matrix?

rachelywong commented 3 years ago

I agree Rachel that we can drop the features that you mentioned. We can also drop "race"(to avoid racial bias) feature as well.

I was thinking about race, but maybe race might be good to keep in because of potential racial bias or biological characteristics of different races affecting patient readmission. For example, if it was more likely that Asian patients were readmitted more often than Hispanic patients, there could be underlying reasons for this.

I have been rethinking about whether or not to keep the race column. So as you can see from below, there is way more of one race compared to the others but the distributions of yes/no between the different races is very similar. I think race can be removed because in my experience, the population who access a hospital more often is dependent on the location of the hospital which will not help us with our prediction. Let me know your thoughts about this. @rachelywong @sukh2929 @wiwang

I like this a lot! Especially with the information from the plots you made. I think it'd be super helpful to add those plots in and I can update the data wrangling section to remove race. I'll also move the splitting data into training and testing to the end after the EDA we do.

I'll also try to read the paper and find some information on race to add to the background info.

rachelywong commented 3 years ago

As agreed upon during our zoom meeting in class: -Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia

Overlapping feature histograms to compare against the target (balanced dataset)

-Look at the correlation of the variables against targets in point correlation matrix @sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @javairia -remember to save png files

Even after dropping all the columns that rachel suggested - we still have 29 variables so graphs comparing the numerical columns against each other. How would we like to eliminate some? Perhaps the paper will provide guidance or the correlation matrix?

I've looked through the features and the ones I've left in seem to be important in helping us determine readmission. I think for EDA we could focus on certain features instead for correlation.

This site helped me a lot with looking at if features were important or not: https://www.hindawi.com/journals/bmri/2014/781670/tab1/

Here's some information I've derived from the correlation plots with the features we have so far from Pandas Profiling:

Screen Shot 2020-11-19 at 6 59 52 PM

High positive correlation between:

num_procedures and time_in_hospital
num_lab_procedures and time_in_hospital
num_medications and time_in_hospital (most correlated)
num_medications and num_lab_procedures
num_medications and num_procedures
number_diagnosis and time_in_hospital
number_diagnosis and num_medications
number_inpatient and number_emergency

High negative correlation between:

num_lab_procedures and admission_type_id
num_procedures and admission_source_id
number_diagnosis and admission_type_id

This makes sense just using common knowledge but maybe we could write notes about this in our EDA and possibly perform some plots on them?

sukh2929 commented 3 years ago

I have noticed that the observations are not independent as there are multiple rows for each patient id. As per the paper, it is recommended that we should keep the first row when it is duplicate but we can discuss this on a group call. The number of rows decreased to 70k from 100k after keeping the unique rows. It would decrease further after dropping NAs.

jraza19 commented 3 years ago

I have noticed that the observations are not independent as there are multiple rows for each patient id. As per the paper, it is recommended that we should keep the first row when it is duplicate but we can discuss this on a group call. The number of rows decreased to 70k from 100k after keeping the unique rows. It would decrease further after dropping NAs.

Amazing catch Sukhdeep! This greatly affects the value of our data! I am going to recheck my graphs following what you described.

sukh2929 commented 3 years ago

As agreed upon during our zoom meeting in class: -Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia

Overlapping feature histograms to compare against the target (balanced dataset)

-Look at the correlation of the variables against targets in point correlation matrix @sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @javairia -remember to save png files

Even after dropping all the columns that rachel suggested - we still have 29 variables so graphs comparing the numerical columns against each other. How would we like to eliminate some? Perhaps the paper will provide guidance or the correlation matrix?

I've looked through the features and the ones I've left in seem to be important in helping us determine readmission. I think for EDA we could focus on certain features instead for correlation.

This site helped me a lot with looking at if features were important or not: https://www.hindawi.com/journals/bmri/2014/781670/tab1/

Here's some information I've derived from the correlation plots with the features we have so far from Pandas Profiling:

High positive correlation between:

num_procedures and time_in_hospital

num_lab_procedures and time_in_hospital

num_medications and time_in_hospital (most correlated)

num_medications and num_lab_procedures

num_medications and num_procedures

number_diagnosis and time_in_hospital

number_diagnosis and num_medications

number_inpatient and number_emergency

High negative correlation between:

num_lab_procedures and admission_type_id

num_procedures and admission_source_id

number_diagnosis and admission_type_id

This makes sense just using common knowledge but maybe we could write notes about this in our EDA and possibly perform some plots on them?

I found this https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features

Does this mean we have to drop columns which are highly correlated?

rachelywong commented 3 years ago

As agreed upon during our zoom meeting in class: -Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia

Overlapping feature histograms to compare against the target (balanced dataset)

-Look at the correlation of the variables against targets in point correlation matrix @sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @javairia -remember to save png files

Even after dropping all the columns that rachel suggested - we still have 29 variables so graphs comparing the numerical columns against each other. How would we like to eliminate some? Perhaps the paper will provide guidance or the correlation matrix?

I've looked through the features and the ones I've left in seem to be important in helping us determine readmission. I think for EDA we could focus on certain features instead for correlation. This site helped me a lot with looking at if features were important or not: https://www.hindawi.com/journals/bmri/2014/781670/tab1/ Here's some information I've derived from the correlation plots with the features we have so far from Pandas Profiling: High positive correlation between:

num_procedures and time_in_hospital

num_lab_procedures and time_in_hospital

num_medications and time_in_hospital (most correlated)

num_medications and num_lab_procedures

num_medications and num_procedures

number_diagnosis and time_in_hospital

number_diagnosis and num_medications

number_inpatient and number_emergency

High negative correlation between:

num_lab_procedures and admission_type_id

num_procedures and admission_source_id

number_diagnosis and admission_type_id

This makes sense just using common knowledge but maybe we could write notes about this in our EDA and possibly perform some plots on them?

I found this https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features

Does this mean we have to drop columns which are highly correlated?

I don't think the ones I mentioned are actually highly correlated. They're just the most correlated out of all of the features, but if you look at the correlation matrix legend theyre all about 0.5 correlated or less, so I think it's fine to keep them in.

sukh2929 commented 3 years ago

As agreed upon during our zoom meeting in class: -Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia

Overlapping feature histograms to compare against the target (balanced dataset)

-Look at the correlation of the variables against targets in point correlation matrix @sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @javairia -remember to save png files

Even after dropping all the columns that rachel suggested - we still have 29 variables so graphs comparing the numerical columns against each other. How would we like to eliminate some? Perhaps the paper will provide guidance or the correlation matrix?

I've looked through the features and the ones I've left in seem to be important in helping us determine readmission. I think for EDA we could focus on certain features instead for correlation.

This site helped me a lot with looking at if features were important or not: https://www.hindawi.com/journals/bmri/2014/781670/tab1/

Here's some information I've derived from the correlation plots with the features we have so far from Pandas Profiling:

High positive correlation between:

num_procedures and time_in_hospital

num_lab_procedures and time_in_hospital

num_medications and time_in_hospital (most correlated)

num_medications and num_lab_procedures

num_medications and num_procedures

number_diagnosis and time_in_hospital

number_diagnosis and num_medications

number_inpatient and number_emergency

High negative correlation between:

num_lab_procedures and admission_type_id

num_procedures and admission_source_id

number_diagnosis and admission_type_id

This makes sense just using common knowledge but maybe we could write notes about this in our EDA and possibly perform some plots on them?

I have tried to find the relation of categorical data and numerical data with the target column.

sukh2929 commented 3 years ago

For comparing the Numerical columns

alt.data_transformers.disable_max_rows() numeric_cols = [“num_procedures”,“time_in_hospital”, “num_lab_procedures”, ‘num_medications’, ‘number_diagnoses’, “number_inpatient”, “number_emergency”] alt.Chart(df).mark_point(size = 4).encode( alt.X(alt.repeat(“column”), type = “quantitative”, scale = alt.Scale(zero = False)), alt.Y(alt.repeat(“row”), type = “quantitative”, scale = alt.Scale(zero = False))).properties(height = 100, width = 100 ).repeat( row = numeric_cols, column = numeric_cols ).configure_axis(labels = False)

UBC-MDS / group29

Ideas for EDA/Visualizations #3