Open Ivyqiuhan opened 2 years ago
Hi Ivy! Thank you so much for reviewing our work. Really helpful feedback. Some of it was addressed already, some we look forward to addressing in the remaining milestones!
Thank you for your comments Ivy! I just wanted to let you know that we are unable to see the grade. Can you please let us know how we may view that?
Thank you for your comments Ivy! I just wanted to let you know that we are unable to see the grade. Can you please let us know how we may view that?
The grade will be available soon on Canvas : )
Hi @Ivyqiuhan!
Thank you again for reviewing our repo and leaving some feedback.
We wanted to discuss some points, because I think a lot of your feedback is not quite applicable to the inference project.
What about data visualization? What specifically are you going to do? Will you do a heat map with correlation to make sure that variables are not redundant? What variables do you mean? We are not running a predictive model or anything like that. We are doing a simple hypothesis test. We also do include data visualization component, but of course it's fairly simple due to the inference nature of our question. Our hypothesis test only depends on a single parameter (the sentence length) for either group so there is no concept of correlation here. We do not have to look at any other variables in the data set. Does that make sense?
What about class balance, there could be an imbalance in the classes, in which you would have to under sample or oversample. We have a representative enough sample on hand, of 4000 indigenous vs 13000 non-indigenous inmates as made clear in EDA. Since this is plenty for hypothesis testing and we are not running any machine learning algorithms, we didn't need to tackle class imbalance.
What about missing data, how will you handle the missing data? As shown in the EDA, in data processing, we dropped all rows without a value for sentence length. For these algorithms, what packages will you use? Packages used were listed in dependencies. We are not doing machine learning so there is no modeling taking place.
"We may deal with this problem when we build the pipeline and model after we see significant drawback." You should definately balance the sample. I suggest generating synthetic data for oversampling. Addressed above.
_this line can't run: df_init = pd.read_csv('../data/offender_profile.csv', sep=r'\s,\s', header=0, encoding='ascii', engine='python')_ Are you looking at the release? I think this was addressed I can't find it in the repo, think you may have looked at old version of eda file somehow? But I could be wrong, just can't find this issue. Overall, would you please kindly take another look at our repo? I feel like perhaps you looked at it through a lens of expecting more of a pipeline and machine learning, when we are doing a very basic hypothesis testing, and we feel like our grade was affected by this slight misunderstanding.
Cheers
Rada/Chaoran/Mukund/Kyle
Nice job! I provide here some comments and your grades for the first milestone. Please address these concerns in your third milestone submission.
Draft a Team work contract: Correctness
Project set-up: Mechanics
A script that downloads the data: Accuracy
A script that downloads the data: Quality
Exploratory data analysis in a literate code document: QUALITY