Submission: Group 5: Inference sentence length

Submitting authors: @Radascript, @AraiYuno, @miyer26, @showcy

Repository: https://github.com/UBC-MDS/DSCI_522_inference_on_indigenous_vs_non_indigenous_sentence_length_differences

Report link: https://htmlpreview.github.io/?https://github.com/UBC-MDS/DSCI_522_inference_on_indigenous_vs_non_indigenous_sentence_length_differences/blob/main/doc/sentence_length_diffs_inference_report.html

Abstract/executive summary: For this project we have carried out a hypothesis test to determine if there was a significant difference in the median sentence lengths between the indigenous and non-indigenous offenders under the Correction Services Canada. The median was selected as the measure of central tendency and a permutation test under the null model was carried out computationally with a significance level of 0.05. The null hypothesis stated that there was no difference in the population medians in sentence length between indigenous and non-indigenous offenders. The alternate hypothesis stated that there is a difference in the population medians in sentence length between indigenous and non-indigenous offenders. The resulting sample difference in the two medians was -56 days, with a corresponding p-value of 0.0328. The indigenous group was found to have shorter sentence lengths than the non-indigenous group. As this p-vaule was smaller than the significance level, there was statistically significant evidence to reject the null hypothesis that stated that there is no statistically significant difference in the median sentence lengths between the two groups. As we had a large sample size for both groups, our model was very sensitive to small differences in the median of both groups. Though this may raise some concern regarding the practical implications of the study, we believed it was important not to miss any existing effect due to the sensitivity of the issue at hand. The cost of a Type II error is more significant than a Type I error.

The data set used for this study is the Offender Profile from 2017-2018 released by the Correctional Service of Canada. The link to this site can be found here. Each entry in the data set corresponds to a single offender serving a two or more year long sentence. The demographic details such as age, gender and marital status at year end are provided for each entry. This was retrieved from the Offender Management System (OMS).

Editor: @Radascript, @AraiYuno, @miyer26, @showcy Reviewer: Nagraj Rao, TZ Yan, Abhiket Gaurav, Adrianne Leung

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @adrianne-l

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Good job, guys! The question is a very interesting topic and it is an excellent idea to use hypothesis test for the findings are intriguing as well as a good indicator for policy review and personal reflection on racial bias matters too.
I find the plots in your EDA are very good tools to help readers to understand the challenge and analysis chosen to explore this question. Personally, I think it would be better to include some of the plots in the Introduction section of the report.
In the report Results & Discussion section, would it be a good idea to include some sub-sections to summarise the interim findings to better navigate and follow your flow of result interpretation?
I appreciate the discussion in the results and you went the extra mile to investigate and propose follow-up analysis to improve the results.
In the Usage script, I cannot execute the script to download the data file. There is an extra argument for
The licence in README.md is not updated according to the dataset used.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @ytz

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Interesting topic!

Love the discussion on Type I & Type II error
Good use of charts to justify the use of median
Density plot and box-plot piqued my curiosity, as the distribution between the 2 groups looked fairly similar

Slight nick-picking and suggestions on the following:

Useful to briefly mention the number of data points on your data set, under the 'Data' subheading.
For Figure 1, resolution is slightly small on the report. Can consider using a larger resolution for Figure 1 to make it look sharper
For Figure 2, consider using log scale on x-axis for Figure 2 to make the box-plots more prominent
Not 100% sure whether the use of 'confidence interval' is correct in "...we noted the large overlap in the confidence intervals between the two groups"
Since the focus is on the indigenous group, you could use a monotone colour for the non-indigenous group, and a primary colour like red or blue for the indigenous group. That will make it easier for the reader to interpret the chart

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Nagraj Rao

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Excellent work team! Your topic is crucial from a policy perspective, and I particularly enjoyed the implementation of hypothesis testing to answer your research question. You did an excellent job with your writing and the flow was seamless and made it an easy read!

My feedback is intended to catapult your work from A+ to an A++. Not all comments may be applicable given the limitations of your dataset (or time), but I figured it is worth mentioning.

Your results show that the median difference is -56. Based on your test statistic, this implies that the median days spent by indigenous group in jail is lower than median days spent by non-indigenous group. This is counter to what most people would expect. Despite the use of medians, this indicates that outliers continue to be a problem (for example: I tried to trim the data to cases below 5000 and I get opposite results, and the results get narrower as I decrease the bandwidth). Can you dive deeper into this?
I noticed that your data has discontinuity in aggregate incarcerations (jumps from 0 to 730, with no values in between). Are these true 0’s or just NULL’s which StatsCan has reported as 0? How does excluding the 0’s change your analysis?
It might be useful to explain what Type I and Type II error means within the context of your study, before providing your excellent explanation on how that may affect the results of your analysis and concluding that: “The cost of a Type II error is more significant than a Type I error.”
I suggest zooming into the boxplot by trimming the outliers. If you only restricted to values <3k for the box plot, you may be able to visualize the strong statistically significant difference you find in your formal hypothesis test. I suggest adjusting by factoring in points 1 and 2 made above when you present the final box plot.
Your alternative hypothesis in the report should insert "not" equal, else it exactly matches the null hypothesis.
Does the large difference in sample size between the two groups bear any consequences for the test results? Is class imbalance a problem?
Is it possible to provide dimensions of the data (total number of observations) for each of the groups in the README and the Data Section of the Report? It is noted that you have this available in your discussion.
Based on the codes provided, your download script indicates that the file should be saved in the raw folder under data. However, I do not see a raw (or processed folder) under data right now, and as a consequence, no data as well. Can you check if the script is working as intended?
(MINOR): In your report, the number of repeats, appears as N_REPEATS, and not a number.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well-known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance of this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5Hrs

Review Comments:

Wow!! Quite an interesting topic has been picked here and I am sure this is not just a theoretical project but has quite a lot of practical applications in policy making etc.

My feedback on the work done, please keep in mind that the feedback is just to make this work an exhaustive one. Hence I might be nick-picking here and there which you might choose to implement/ignore.:

The project proposal has a question mark. If the proposal is in a questioning tone, maybe it needs rewording.
“As we had a large sample size for both groups, our model was very sensitive to small differences in the median of both groups.“ This seems counter-intuitive.
The sample chosen only corresponds to a single offender serving a two or more-year-long sentence. This is not the representation of the overall population, and hence I feel the project title should be changed accordingly.
“meaning we had a large enough sample size to carry out a t-test.” It is a t-test, then the test stat should be t-stat.
We can see the number of offenders belonging to non-indigenous >> indigenous. Can we use various sampling techniques to address this problem?
Is the data skewed or is it the nature of data? People getting life sentences would be very low in number as compared to people getting low term sentences. This reason should be mentioned somewhere. Also, we should do some outlier treatment and look at data at various sample sizes. Does the hypothesis holds or is it reversed?
Fig 1: The y-axis has no meaning, hence it can be removed to make the graph better/clear.
Box plot in Fig2, does not show confidence intervals. It is 75% percentile and 25% percentile.
Fig 3: The distribution intuitively does not look like a normal distribution. Plotting the “most likely function” would be a better visualization.
Also, since it is skewed data shouldn’t the null hypothesis be one-sided
To substantiate the finding we should have incorporated metrics like the power of the test.
EDA could have been more explanatory. Looking at the correlation, between different variables. Their relationship with the target and if there are any interactions between them.

Nevertheless, this is good work. Kudos to the team for all the efforts and hard work.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thank you for all of your comments! We appreciated, agreed with, and implemented some of your comments.

About report

From: @ytz and @nrao944

Useful to briefly mention the number of data points on your data set, under the 'Data' subheading.

Is it possible to provide dimensions of the data (total number of observations) for each of the groups in the README and the Data Section of the Report? It is noted that you have this available in your discussion.

Our implementation:

Add the data dimension to "Data" section. - @93402f8

From: @adrianne-l

In the report Results & Discussion section, would it be a good idea to include some sub-sections to summarise the interim findings to better navigate and follow your flow of result interpretation?

Our implementation:

Add more details to navigate the reader understand our flow of interpretation. - @93402f8

From: @nrao944

Your alternative hypothesis in the report should insert "not" equal, else it exactly matches the null hypothesis.

In your report, the number of repeats, appears as N_REPEATS, and not a number.

Our implementation:

Fix the typos. - @9c3adc6 and @b6f3a8c

About data visualization

From: @ytz

Not 100% sure whether the use of 'confidence interval' is correct in "...we noted the large overlap in the confidence intervals between the two groups"

For Figure 2, consider using log scale on x-axis for Figure 2 to make the box-plots more prominent

Since the focus is on the indigenous group, you could use a monotone colour for the non-indigenous group, and a primary colour like red or blue for the indigenous group. That will make it easier for the reader to interpret the chart

Our implementation:

Change to "we noted the large overlap in the quantiles between the two groups". - @93402f8
Change the box plot to log scale. - @9eb3585
Change color of box plots to highlight indig group. - @4257f08

Review of Milestone 1 from TA @Ivyqiuhan

This line can't run: df_init = pd.read_csv('../data/offender_profile.csv', sep=r'\s,\s', header=0, encoding='ascii', engine='python') because your file is at this path '../data/RAW/offender_profile.csv'

I can't run your code it has KeyError: 'Sentence Type' at cell 10

Our implementation:

Fix with Makefile in Milestone 3. - @943094d

If figure captions are not provided the plot should be clearly explained in the text. I would recommend using figure captions.

Our implementation:

Add captions in details. - @fbfc939

Need to add some explanation to the plot and your code

Our implementation:

Try to add more explanation through out all commits.

Review of Milestone 2 from TA @Ivyqiuhan

You should create an environment.yaml file to contain all your dependencies

Our implementation:

Add the environment.yaml. - @9f728ce

In usage, should write how to run each of your scripts, not just "make all" and "make clean"

Our implementation:

Put our old step by step usage back to README again. - @8ae2880

UBC-MDS / data-analysis-review-2021