UBC-MDS / inference_on_indigenous_vs_non_indigenous_sentence_length_differences

MIT License
3 stars 4 forks source link

Feedback from the reviewers #26

Closed showcy closed 2 years ago

showcy commented 2 years ago

I am listing the feedbacks from the reviewers regarding different parts here. Let’s discuss which we agree with.

Reports:

  1. Useful to briefly mention the number of data points on your data set, under the 'Data' subheading. (Two reviewers mentioned this)
  2. In the report Results & Discussion section, would it be a good idea to include some sub-sections to summarise the interim findings to better navigate and follow your flow of result interpretation?
  3. Your alternative hypothesis in the report should insert "not" equal, else it exactly matches the null hypothesis.

Data visualization:

  1. For Figure 1, resolution is slightly small on the report. Can consider using a larger resolution for Figure 1 to make it look sharper
  2. For Figure 2, consider using log scale on x-axis for Figure 2 to make the box-plots more prominent
  3. Not 100% sure whether the use of 'confidence interval' is correct in "...we noted the large overlap in the confidence intervals between the two groups"
  4. Since the focus is on the indigenous group, you could use a monotone colour for the non-indigenous group, and a primary colour like red or blue for the indigenous group. That will make it easier for the reader to interpret the chart
  5. Personally, I think it would be better to include some of the plots in the Introduction section of the report.
  6. I suggest zooming into the boxplot by trimming the outliers. If you only restricted to values <3k for the box plot, you may be able to visualize the strong statistically significant difference you find in your formal hypothesis test. I suggest adjusting by factoring in points 1 and 2 made above when you present the final box plot.

Dataset related problems:

  1. Your results show that the median difference is -56. Based on your test statistic, this implies that the median days spent by indigenous group in jail is lower than median days spent by non-indigenous group. This is counter to what most people would expect. Despite the use of medians, this indicates that outliers continue to be a problem (for example: I tried to trim the data to cases below 5000 and I get opposite results, and the results get narrower as I decrease the bandwidth). Can you dive deeper into this?
  2. I noticed that your data has discontinuity in aggregate incarcerations (jumps from 0 to 730, with no values in between). Are these true 0’s or just NULL’s which StatsCan has reported as 0? How does excluding the 0’s change your analysis?

The following parts we seems had discussed about or solved:

  1. Does the large difference in sample size between the two groups bear any consequences for the test results? Is class imbalance a problem? - We agreed it should not be the problem.
  2. It might be useful to explain what Type I and Type II error means within the context of your study, before providing your excellent explanation on how that may affect the results of your analysis and concluding that: “The cost of a Type II error is more significant than a Type I error.” - We agreed to delete this part from the report.
  3. In your report, the number of repeats, appears as N_REPEATS, and not a number. - We had replaced it with the raw number.