Closed jraza19 closed 3 years ago
We have lots of categorical features and numeral features in our diabetes dataset:
For extracting the diabetes_csv dataset only from the zip file we can use this code:
# Packages necessary for importing data (from a zip file containing 2 dataset CSVs)
import requests, zipfile
from urllib.request import urlopen
from io import BytesIO
zip_file_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00296/dataset_diabetes.zip"
zip_file_load = urlopen(zip_file_url)
zipinmemory = BytesIO(zip_file_load.read())
zip_file = zipfile.ZipFile(zipinmemory)
# Only load the first file in the zip folder
diabetes_csv = pd.read_csv(zip_file.open(zip_file.namelist()[0]))
print(diabetes_csv.head())
Hi! Please take a look at the current EDA report from my fork (group29-rachel/reports/EDA_initial.ipynb)and let me know your thoughts on:
Hi Rachel, please see below for my response for part 3. I think this code works to fix the target column
pattern = r'[<>]30'
diabetes_csv["readmitted"] = diabetes_csv["readmitted"].str.replace(pattern,"YES",regex = True)
Hi! Please take a look at the current EDA report from my fork (group29-rachel/reports/EDA_initial.ipynb)and let me know your thoughts on:
- Which features should we discard if any? (I am thinking encounter_id, patient_nbr, maybe some others)
- How should we go about features with many missing values (ex// weight)?
- We need to fix the target column (readmitted vs. not readmitted)
For 1. Which features should we discard if any?
Link to feature descriptions: https://www.hindawi.com/journals/bmri/2014/781670/tab1/ I have looked through the features and I think these ones would be useful to drop:
Let me know your thoughts :)
I agree Rachel that we can drop the features that you mentioned. We can also drop "race"(to avoid racial bias) feature as well.
I agree Rachel that we can drop the features that you mentioned. We can also drop "race"(to avoid racial bias) feature as well.
I was thinking about race, but maybe race might be good to keep in because of potential racial bias or biological characteristics of different races affecting patient readmission. For example, if it was more likely that Asian patients were readmitted more often than Hispanic patients, there could be underlying reasons for this.
As agreed upon during our zoom meeting in class:
-Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia
-Look at the correlation of the variables against targets in point correlation matrix @Sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @Javairia -remember to save png files
For 1. Which features should we discard if any?
Link to feature descriptions: https://www.hindawi.com/journals/bmri/2014/781670/tab1/ I have looked through the features and I think these ones would be useful to drop:
- encounter_id (useless)
- patient_nbr (useless)
- weight (not enough data, 97% missing)
- Payer code (useless, 52% missing)
- Medical specialty (useless, 53% missing)
Let me know your thoughts :)
Just noticed that examide and citoglipton had 100% of responses as NO. I'll remove these features from our analysis, since they won't be helpful in answering our question.
I agree Rachel that we can drop the features that you mentioned. We can also drop "race"(to avoid racial bias) feature as well.
I was thinking about race, but maybe race might be good to keep in because of potential racial bias or biological characteristics of different races affecting patient readmission. For example, if it was more likely that Asian patients were readmitted more often than Hispanic patients, there could be underlying reasons for this.
I have been rethinking about whether or not to keep the race column. So as you can see from below, there is way more of one race compared to the others but the distributions of yes/no between the different races is very similar. I think race can be removed because in my experience, the population who access a hospital more often is dependent on the location of the hospital which will not help us with our prediction. Let me know your thoughts about this. @rachelywong @sukh2929 @wiwang
As agreed upon during our zoom meeting in class:
-Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia
- Overlapping feature histograms to compare against the target (balanced dataset)
-Look at the correlation of the variables against targets in point correlation matrix @sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @javairia -remember to save png files
Even after dropping all the columns that rachel suggested - we still have 29 variables so graphs comparing the numerical columns against each other. How would we like to eliminate some? Perhaps the paper will provide guidance or the correlation matrix?
I agree Rachel that we can drop the features that you mentioned. We can also drop "race"(to avoid racial bias) feature as well.
I was thinking about race, but maybe race might be good to keep in because of potential racial bias or biological characteristics of different races affecting patient readmission. For example, if it was more likely that Asian patients were readmitted more often than Hispanic patients, there could be underlying reasons for this.
I have been rethinking about whether or not to keep the race column. So as you can see from below, there is way more of one race compared to the others but the distributions of yes/no between the different races is very similar. I think race can be removed because in my experience, the population who access a hospital more often is dependent on the location of the hospital which will not help us with our prediction. Let me know your thoughts about this. @rachelywong @sukh2929 @wiwang
I like this a lot! Especially with the information from the plots you made. I think it'd be super helpful to add those plots in and I can update the data wrangling section to remove race. I'll also move the splitting data into training and testing to the end after the EDA we do.
I'll also try to read the paper and find some information on race to add to the background info.
As agreed upon during our zoom meeting in class: -Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia
- Overlapping feature histograms to compare against the target (balanced dataset)
-Look at the correlation of the variables against targets in point correlation matrix @sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @javairia -remember to save png files
Even after dropping all the columns that rachel suggested - we still have 29 variables so graphs comparing the numerical columns against each other. How would we like to eliminate some? Perhaps the paper will provide guidance or the correlation matrix?
I've looked through the features and the ones I've left in seem to be important in helping us determine readmission. I think for EDA we could focus on certain features instead for correlation.
This site helped me a lot with looking at if features were important or not: https://www.hindawi.com/journals/bmri/2014/781670/tab1/
Here's some information I've derived from the correlation plots with the features we have so far from Pandas Profiling:
High positive correlation between:
High negative correlation between:
This makes sense just using common knowledge but maybe we could write notes about this in our EDA and possibly perform some plots on them?
I have noticed that the observations are not independent as there are multiple rows for each patient id. As per the paper, it is recommended that we should keep the first row when it is duplicate but we can discuss this on a group call. The number of rows decreased to 70k from 100k after keeping the unique rows. It would decrease further after dropping NAs.
I have noticed that the observations are not independent as there are multiple rows for each patient id. As per the paper, it is recommended that we should keep the first row when it is duplicate but we can discuss this on a group call. The number of rows decreased to 70k from 100k after keeping the unique rows. It would decrease further after dropping NAs.
Amazing catch Sukhdeep! This greatly affects the value of our data! I am going to recheck my graphs following what you described.
As agreed upon during our zoom meeting in class: -Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia
- Overlapping feature histograms to compare against the target (balanced dataset)
-Look at the correlation of the variables against targets in point correlation matrix @sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @javairia -remember to save png files
Even after dropping all the columns that rachel suggested - we still have 29 variables so graphs comparing the numerical columns against each other. How would we like to eliminate some? Perhaps the paper will provide guidance or the correlation matrix?
I've looked through the features and the ones I've left in seem to be important in helping us determine readmission. I think for EDA we could focus on certain features instead for correlation.
This site helped me a lot with looking at if features were important or not: https://www.hindawi.com/journals/bmri/2014/781670/tab1/
Here's some information I've derived from the correlation plots with the features we have so far from Pandas Profiling:
High positive correlation between:
- num_procedures and time_in_hospital
- num_lab_procedures and time_in_hospital
- num_medications and time_in_hospital (most correlated)
- num_medications and num_lab_procedures
- num_medications and num_procedures
- number_diagnosis and time_in_hospital
- number_diagnosis and num_medications
- number_inpatient and number_emergency
High negative correlation between:
- num_lab_procedures and admission_type_id
- num_procedures and admission_source_id
- number_diagnosis and admission_type_id
This makes sense just using common knowledge but maybe we could write notes about this in our EDA and possibly perform some plots on them?
Does this mean we have to drop columns which are highly correlated?
As agreed upon during our zoom meeting in class: -Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia
- Overlapping feature histograms to compare against the target (balanced dataset)
-Look at the correlation of the variables against targets in point correlation matrix @sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @javairia -remember to save png files
Even after dropping all the columns that rachel suggested - we still have 29 variables so graphs comparing the numerical columns against each other. How would we like to eliminate some? Perhaps the paper will provide guidance or the correlation matrix?
I've looked through the features and the ones I've left in seem to be important in helping us determine readmission. I think for EDA we could focus on certain features instead for correlation. This site helped me a lot with looking at if features were important or not: https://www.hindawi.com/journals/bmri/2014/781670/tab1/ Here's some information I've derived from the correlation plots with the features we have so far from Pandas Profiling: High positive correlation between:
- num_procedures and time_in_hospital
- num_lab_procedures and time_in_hospital
- num_medications and time_in_hospital (most correlated)
- num_medications and num_lab_procedures
- num_medications and num_procedures
- number_diagnosis and time_in_hospital
- number_diagnosis and num_medications
- number_inpatient and number_emergency
High negative correlation between:
- num_lab_procedures and admission_type_id
- num_procedures and admission_source_id
- number_diagnosis and admission_type_id
This makes sense just using common knowledge but maybe we could write notes about this in our EDA and possibly perform some plots on them?
Does this mean we have to drop columns which are highly correlated?
I don't think the ones I mentioned are actually highly correlated. They're just the most correlated out of all of the features, but if you look at the correlation matrix legend theyre all about 0.5 correlated or less, so I think it's fine to keep them in.
As agreed upon during our zoom meeting in class: -Split Data into Training and Testing Before Further Exploration @ Rachel -Repeated histograms comparing one numerical variable in the x compare to another variable @ Javairia
- Overlapping feature histograms to compare against the target (balanced dataset)
-Look at the correlation of the variables against targets in point correlation matrix @sukhdeep -Scatterplot @ Sukhdeep -Look at how race affects distribution of readmission, decide if we should drop race column or not @javairia -remember to save png files
Even after dropping all the columns that rachel suggested - we still have 29 variables so graphs comparing the numerical columns against each other. How would we like to eliminate some? Perhaps the paper will provide guidance or the correlation matrix?
I've looked through the features and the ones I've left in seem to be important in helping us determine readmission. I think for EDA we could focus on certain features instead for correlation.
This site helped me a lot with looking at if features were important or not: https://www.hindawi.com/journals/bmri/2014/781670/tab1/
Here's some information I've derived from the correlation plots with the features we have so far from Pandas Profiling:
High positive correlation between:
- num_procedures and time_in_hospital
- num_lab_procedures and time_in_hospital
- num_medications and time_in_hospital (most correlated)
- num_medications and num_lab_procedures
- num_medications and num_procedures
- number_diagnosis and time_in_hospital
- number_diagnosis and num_medications
- number_inpatient and number_emergency
High negative correlation between:
- num_lab_procedures and admission_type_id
- num_procedures and admission_source_id
- number_diagnosis and admission_type_id
This makes sense just using common knowledge but maybe we could write notes about this in our EDA and possibly perform some plots on them?
I have tried to find the relation of categorical data and numerical data with the target column.
For comparing the Numerical columns
alt.data_transformers.disable_max_rows() numeric_cols = [“num_procedures”,“time_in_hospital”, “num_lab_procedures”, ‘num_medications’, ‘number_diagnoses’, “number_inpatient”, “number_emergency”] alt.Chart(df).mark_point(size = 4).encode( alt.X(alt.repeat(“column”), type = “quantitative”, scale = alt.Scale(zero = False)), alt.Y(alt.repeat(“row”), type = “quantitative”, scale = alt.Scale(zero = False))).properties(height = 100, width = 100 ).repeat( row = numeric_cols, column = numeric_cols ).configure_axis(labels = False)
Please provide your ideas for the EDA and type of visualizations that would work well