Milestone 1 Review - Githubissues

Ivyqiuhan commented 2 years ago

Nice job! I provide here some comments and your grades for the first milestone. Please address these concerns in your third milestone submission.

Draft a Team work contract: Correctness
Project set-up: Mechanics
- this line can't run: df_init = pd.read_csv('../data/offender_profile.csv', sep=r'\s,\s', header=0, encoding='ascii', engine='python') because your file is at this path '../data/RAW/offender_profile.csv'

I can't run your code it has KeyError: 'Sentence Type' at cell 10

Project proposal: reasoning

What about data visualization? What specifically are you going to do? Will you do a heat map with correlation to make sure that variables are not redundant?
What about class balance, there could be an imbalance in the classes, in which you would have to under sample or oversample.
What about missing data, how will you handle the missing data?
For these algorithms, what packages will you use?

A script that downloads the data: Accuracy
A script that downloads the data: Quality
Exploratory data analysis in a literate code document: QUALITY

"We may deal with this problem when we build the pipeline and model after we see significant drawback." You should definately balance the sample. I suggest generating synthetic data for oversampling.

Exploratory data analysis in a literate code document: VIZ

If figure captions are not provided the plot should be clearly explained in the text. I would recommend using figure captions.

Exploratory data analysis in a literate code document: REASONING

need to add some explanation to the plot and your code

Exploratory data analysis in a literate code document: ACCURACY
Expectations: Mechanics

Radascript commented 2 years ago

Hi Ivy! Thank you so much for reviewing our work. Really helpful feedback. Some of it was addressed already, some we look forward to addressing in the remaining milestones!

miyer26 commented 2 years ago

Thank you for your comments Ivy! I just wanted to let you know that we are unable to see the grade. Can you please let us know how we may view that?

Ivyqiuhan commented 2 years ago

Thank you for your comments Ivy! I just wanted to let you know that we are unable to see the grade. Can you please let us know how we may view that?

The grade will be available soon on Canvas : )

Radascript commented 2 years ago

Hi @Ivyqiuhan!

Thank you again for reviewing our repo and leaving some feedback.

We wanted to discuss some points, because I think a lot of your feedback is not quite applicable to the inference project.

What about data visualization? What specifically are you going to do? Will you do a heat map with correlation to make sure that variables are not redundant? What variables do you mean? We are not running a predictive model or anything like that. We are doing a simple hypothesis test. We also do include data visualization component, but of course it's fairly simple due to the inference nature of our question. Our hypothesis test only depends on a single parameter (the sentence length) for either group so there is no concept of correlation here. We do not have to look at any other variables in the data set. Does that make sense?

What about class balance, there could be an imbalance in the classes, in which you would have to under sample or oversample. We have a representative enough sample on hand, of 4000 indigenous vs 13000 non-indigenous inmates as made clear in EDA. Since this is plenty for hypothesis testing and we are not running any machine learning algorithms, we didn't need to tackle class imbalance.

What about missing data, how will you handle the missing data? As shown in the EDA, in data processing, we dropped all rows without a value for sentence length. For these algorithms, what packages will you use? Packages used were listed in dependencies. We are not doing machine learning so there is no modeling taking place.

"We may deal with this problem when we build the pipeline and model after we see significant drawback." You should definately balance the sample. I suggest generating synthetic data for oversampling. Addressed above.

_this line can't run: df_init = pd.read_csv('../data/offender_profile.csv', sep=r'\s,\s', header=0, encoding='ascii', engine='python')_ Are you looking at the release? I think this was addressed I can't find it in the repo, think you may have looked at old version of eda file somehow? But I could be wrong, just can't find this issue. Overall, would you please kindly take another look at our repo? I feel like perhaps you looked at it through a lens of expecting more of a pipeline and machine learning, when we are doing a very basic hypothesis testing, and we feel like our grade was affected by this slight misunderstanding.

Cheers

Rada/Chaoran/Mukund/Kyle

UBC-MDS / inference_on_indigenous_vs_non_indigenous_sentence_length_differences

Milestone 1 Review #20