UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 5: Inference sentence length #7

Open Radascript opened 2 years ago

Radascript commented 2 years ago

Submitting authors: @Radascript, @AraiYuno, @miyer26, @showcy

Repository: https://github.com/UBC-MDS/DSCI_522_inference_on_indigenous_vs_non_indigenous_sentence_length_differences

Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/DSCI_522_inference_on_indigenous_vs_non_indigenous_sentence_length_differences/blob/main/doc/sentence_length_diffs_inference_report.html

Abstract/executive summary: For this project we have carried out a hypothesis test to determine if there was a significant difference in the median sentence lengths between the indigenous and non-indigenous offenders under the Correction Services Canada. The median was selected as the measure of central tendency and a permutation test under the null model was carried out computationally with a significance level of 0.05. The null hypothesis stated that there was no difference in the population medians in sentence length between indigenous and non-indigenous offenders. The alternate hypothesis stated that there is a difference in the population medians in sentence length between indigenous and non-indigenous offenders. The resulting sample difference in the two medians was -56 days, with a corresponding p-value of 0.0328. The indigenous group was found to have shorter sentence lengths than the non-indigenous group. As this p-vaule was smaller than the significance level, there was statistically significant evidence to reject the null hypothesis that stated that there is no statistically significant difference in the median sentence lengths between the two groups. As we had a large sample size for both groups, our model was very sensitive to small differences in the median of both groups. Though this may raise some concern regarding the practical implications of the study, we believed it was important not to miss any existing effect due to the sensitivity of the issue at hand. The cost of a Type II error is more significant than a Type I error.

The data set used for this study is the Offender Profile from 2017-2018 released by the Correctional Service of Canada. The link to this site can be found here. Each entry in the data set corresponds to a single offender serving a two or more year long sentence. The demographic details such as age, gender and marital status at year end are provided for each entry. This was retrieved from the Offender Management System (OMS).

Editor: @Radascript, @AraiYuno, @miyer26, @showcy Reviewer: Nagraj Rao, TZ Yan, Abhiket Gaurav, Adrianne Leung

adrianne-l commented 2 years ago

Data analysis review checklist

Reviewer: @adrianne-l

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

ytz commented 2 years ago

Data analysis review checklist

Reviewer: @ytz

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

Interesting topic!

Slight nick-picking and suggestions on the following:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

nrao944 commented 2 years ago

Data analysis review checklist

Reviewer: Nagraj Rao

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Excellent work team! Your topic is crucial from a policy perspective, and I particularly enjoyed the implementation of hypothesis testing to answer your research question. You did an excellent job with your writing and the flow was seamless and made it an easy read!

My feedback is intended to catapult your work from A+ to an A++. Not all comments may be applicable given the limitations of your dataset (or time), but I figured it is worth mentioning.

  1. Your results show that the median difference is -56. Based on your test statistic, this implies that the median days spent by indigenous group in jail is lower than median days spent by non-indigenous group. This is counter to what most people would expect. Despite the use of medians, this indicates that outliers continue to be a problem (for example: I tried to trim the data to cases below 5000 and I get opposite results, and the results get narrower as I decrease the bandwidth). Can you dive deeper into this?
  2. I noticed that your data has discontinuity in aggregate incarcerations (jumps from 0 to 730, with no values in between). Are these true 0’s or just NULL’s which StatsCan has reported as 0? How does excluding the 0’s change your analysis?
  3. It might be useful to explain what Type I and Type II error means within the context of your study, before providing your excellent explanation on how that may affect the results of your analysis and concluding that: “The cost of a Type II error is more significant than a Type I error.”
  4. I suggest zooming into the boxplot by trimming the outliers. If you only restricted to values <3k for the box plot, you may be able to visualize the strong statistically significant difference you find in your formal hypothesis test. I suggest adjusting by factoring in points 1 and 2 made above when you present the final box plot.
  5. Your alternative hypothesis in the report should insert "not" equal, else it exactly matches the null hypothesis.
  6. Does the large difference in sample size between the two groups bear any consequences for the test results? Is class imbalance a problem?
  7. Is it possible to provide dimensions of the data (total number of observations) for each of the groups in the README and the Data Section of the Report? It is noted that you have this available in your discussion.
  8. Based on the codes provided, your download script indicates that the file should be saved in the raw folder under data. However, I do not see a raw (or processed folder) under data right now, and as a consequence, no data as well. Can you check if the script is working as intended?
  9. (MINOR): In your report, the number of repeats, appears as N_REPEATS, and not a number.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

abhiket commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5Hrs

Review Comments:

Wow!! Quite an interesting topic has been picked here and I am sure this is not just a theoretical project but has quite a lot of practical applications in policy making etc.

My feedback on the work done, please keep in mind that the feedback is just to make this work an exhaustive one. Hence I might be nick-picking here and there which you might choose to implement/ignore.:

  1. The project proposal has a question mark. If the proposal is in a questioning tone, maybe it needs rewording.
  2. “As we had a large sample size for both groups, our model was very sensitive to small differences in the median of both groups.“ This seems counter-intuitive.
  3. The sample chosen only corresponds to a single offender serving a two or more-year-long sentence. This is not the representation of the overall population, and hence I feel the project title should be changed accordingly.
  4. “meaning we had a large enough sample size to carry out a t-test.” It is a t-test, then the test stat should be t-stat.
  5. We can see the number of offenders belonging to non-indigenous >> indigenous. Can we use various sampling techniques to address this problem?
  6. Is the data skewed or is it the nature of data? People getting life sentences would be very low in number as compared to people getting low term sentences. This reason should be mentioned somewhere. Also, we should do some outlier treatment and look at data at various sample sizes. Does the hypothesis holds or is it reversed?
  7. Fig 1: The y-axis has no meaning, hence it can be removed to make the graph better/clear.
  8. Box plot in Fig2, does not show confidence intervals. It is 75% percentile and 25% percentile.
  9. Fig 3: The distribution intuitively does not look like a normal distribution. Plotting the “most likely function” would be a better visualization.
  10. Also, since it is skewed data shouldn’t the null hypothesis be one-sided
  11. To substantiate the finding we should have incorporated metrics like the power of the test.
  12. EDA could have been more explanatory. Looking at the correlation, between different variables. Their relationship with the target and if there are any interactions between them.

Nevertheless, this is good work. Kudos to the team for all the efforts and hard work.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

showcy commented 2 years ago

Thank you for all of your comments! We appreciated, agreed with, and implemented some of your comments.

About report

From: @ytz and @nrao944

  • Useful to briefly mention the number of data points on your data set, under the 'Data' subheading.
  • Is it possible to provide dimensions of the data (total number of observations) for each of the groups in the README and the Data Section of the Report? It is noted that you have this available in your discussion.

Our implementation:

From: @adrianne-l

  • In the report Results & Discussion section, would it be a good idea to include some sub-sections to summarise the interim findings to better navigate and follow your flow of result interpretation?

Our implementation:

From: @nrao944

  • Your alternative hypothesis in the report should insert "not" equal, else it exactly matches the null hypothesis.
  • In your report, the number of repeats, appears as N_REPEATS, and not a number.

Our implementation:

About data visualization

From: @ytz

  • Not 100% sure whether the use of 'confidence interval' is correct in "...we noted the large overlap in the confidence intervals between the two groups"
  • For Figure 2, consider using log scale on x-axis for Figure 2 to make the box-plots more prominent
  • Since the focus is on the indigenous group, you could use a monotone colour for the non-indigenous group, and a primary colour like red or blue for the indigenous group. That will make it easier for the reader to interpret the chart

Our implementation:

Review of Milestone 1 from TA @Ivyqiuhan

  • This line can't run: df_init = pd.read_csv('../data/offender_profile.csv', sep=r'\s,\s', header=0, encoding='ascii', engine='python') because your file is at this path '../data/RAW/offender_profile.csv'
  • I can't run your code it has KeyError: 'Sentence Type' at cell 10

Our implementation:

  • If figure captions are not provided the plot should be clearly explained in the text. I would recommend using figure captions.

Our implementation:

  • Need to add some explanation to the plot and your code

Our implementation:

Review of Milestone 2 from TA @Ivyqiuhan

  • You should create an environment.yaml file to contain all your dependencies

Our implementation:

  • In usage, should write how to run each of your scripts, not just "make all" and "make clean"

Our implementation: