Submission: Group 18: English Language Learning Ability Prediction Analysis

Submitting authors: @rbouwer @farrandi @atabak-alishiri @salva-u

Repository: https://github.com/UBC-MDS/522-workflows-group-18 Report link: https://ubc-mds.github.io/522-workflows-group-18/docs/english_language_learning_ability_prediction_analysis.html Abstract/executive summary: The analysis details the construction of a linear regression model to predict an individual’s English Proficiency Score, considering factors like age, education, and language background. The final model employed Ridge linear regression with L2 regularization, achieving an optimal alpha value of 1.546352. Performance evaluation used two metrics: R-squared score and Root Mean Squared Error (RMSE). The model’s R-squared value was 0.2424, explaining about 24.24% of the variance in correct English Proficiency Scores, while the RMSE indicated an average prediction error of 5.3178%. Our model did not perform that well as seen from our RMSE and R-squared scores. This might be because the features do not have a linear relation to the score so we are thinking of improving this model by using Polynomial Feature Transform and/or ensemble methods. However, analysis revealed that the model performed better for higher actual English Proficiency Scores, making it potentially useful as an initial tool in the analysis of individuals wishing to learn English. The model could guide resource allocation or the level of guidance necessary for efficient English learning. The most significant features in the dataset related to English Proficiency Scores were found to be the “Eng_little” encoding, indicating the individual’s current level of English (e.g., native, immersion learner, non-immersion learner).

Editor: @ttimbers Reviewer: Marco Polo Bravo Montiel (@marcony1), Weilin Han (@hwl1008), Kittipong Wongwipasamitkun (@kwjo), Chris Gao (@chrisgqy)

[X] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: <@jokittipong>

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[ ] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

In readme.md, you specify the project's license on MIT license but not specify the license on your database. For the license part, I did a little bit research, your database and also article that use this database is funded by some organization and put it as the free PMC (PubMedCentral) article, which freely available for public, and I found the disclaimer on this "https://pubmed.ncbi.nlm.nih.gov/disclaimer/". I don't know how to correctly provided it in license.md, may be you create a section for the data you used and specify the disclaimer of free PMC over there (or better ask tiff on how exactly you should do this). Also this is the link to term of use on the website you downloaded "https://github.com/CenterForOpenScience/cos.io/blob/master/TERMS_OF_USE.md" but as your data is from third party, not from this website directly, so it's useless to look at it (in my opinion). But I posted here, just in case, you want to check by yourselves. Hope it helps!!!

Don’t forget to put your names as authors under your project name in the report.

For the report, you may provide some visual on some distributions, which you mainly focused, it may improve how to visualise the report a bit.(optional)

Your repository name should change to relate to the project (change it into your project name).

You can add contact channel on readme (you already have contact topic but no instruction on how to contact)

Introduction in report is already very good but, in my opinion, if you add on a bit usage of this analysis (that you provided in summary) in a bit of introduction to make awareness to reader on how useful your analysis it can be, it would be great. (optional)

I also saw you provided the limitation of the analysis but, in my opinion, if you seperate it as subsection to bold out and make it easier for reader to read, it may be better to find limitation on your report. (optional)

In README.md about the paragraph is concatenate in github, you should split it.(I understand that it will cause from error on markdown, may be you should give it two extra white line to separate it)

love your README section,very well organised and very easy to read, I will adapt to my group too

Finally, I learned a great deal from reading your work and I will adapt it to my group project too. Thanks a lot for your great work!!!!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[ ] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5

Review Comments:

The licence should include both the MIT licence and licence of the dataset being analyzed.
Team members’ names should be included at the top of the report
The code chunk of dependencies employed for the analyses should be hidden
The automation of testing is not setup properly. I would suggest the team to add on the package “pytest” to streamline all the testing functions.
I would recommend to visually present the distribution of data being used in the EDA section. I didn’t raise any issue abut this in the check list but the distribution plots are nice to have.
You can totally ignore this note but I feel like there might be potentials to improve your model from the current R-squared score of 0.24. I recommend to employ more complex model structure and introduce more explanatory variables.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[ ] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

1.5

Review Comments:

In addition to all the issues aforementioned, I have following suggestions:

README.md file looks organized but it would be great if you could break down the about section into several paragraphs so it is more reader-friendly (same for the summary section in the report). I like icons in the subtitles!
I think the team contract is not necessary to live in the repo since this repo is public and visible to everyone.
In background section, you mentioned "Various studies have explored a range of determinants, including age, educational background, language exposure, and the presence of learning disabilities like dyslexia", so I think you'd better include the references of these studies.
Some variable insertion errors: "suggesting a % average prediction error." in method and results section in the report; "RMSE scores and fit times ({numref}lasso_top_models-fig)." for figure 3.
Not sure if it's a good practice to put EDA in another file, please confirm with Tiff or TA regarding this. But I think the result plot is sufficient to justify decision.
Figure 4 is too small compare to other figures, and the text looks blur because of low resolution.
Some typos such as: in the paragraphs under figure 5: "as we increase the predited scores", "which combination os features would", etc. Overall, the report and analysis looks good and I have learnt a lot from it!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @Marcony1

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Some typos. Small typo ("... want toto close...") in the note after the third point in the docker method of the instructions. There are also some minor typos in the report (ex. "predited scores" in the paragraph that comes after Fig. 5).
I could not run your tests. In your test codes, you are importing modules from src, but they are not in src, they are in src/helper.
Regarding the data, you said that you dropped some columns because they were filled with zeros. Probably, that happened because your original set was too big and you got a sample of 30% to work with. Although there are not many, for example, dyslexic, maybe that would be a variable with a high impact on those who are dyslexic. Do not incorporate for this point as it would probably not increase $R^2$ since there aren't many dyslexics in your dataset, Still, it would have been interesting to investigate that variable.
Regarding the correlation matrix, I would suggest changing the color scheme. Apparently, red means "strong positive correlation". However, there are positive and negative values in blue. Also, it would be better if a dark shade of the color were used for strong correlations (either positive or negative), while using light shades for not so correlated pairs. At this moment, a stronger/weaker correlation may lead to either shade of color.
Regarding Fig. 5, it would be interesting if you discussed more about why there seems to be a straight line acting as a boundary on the upper right-hand side of your scatterplot. Also, the graph seem a little bit crowded, it would probably be better if you made it a little bit bigger and if you made the dots more translucent or changed the color scheme to one that allow us to see the density (how crowded) of different sections of the plot (maybe a 2-D histogram would be better).
Although not extremely necessary to include, distribution plots could really help identify patterns in continuous variables. It would also let you decide on whether you should apply a transformation (ex. log transformation, or kbinsdiscretizer) to some of your variables.
Since you are dealing with some categorical variables, I think it would have also helped to include some sort of general plot (maybe a pie chart just to get an idea of how the data is distributed). Maybe some of your variables are actually not that useful because most of them belong to a particular category.

I found your project remarkably interesting. It caught my attention from the very beginning, and it also made me be excited (regardless of the results). I can understand that the $R^2$ may have been disappointing after all your hard work. However, it is not because you did it wrong or because your original idea is not correct. There is definitely something in there. Though, it is understandable that computational and time constraints did not let you fully explore the universe of your problem. If refined enough, the ideas in this project could help give people an idea of how much they would score in an exam before they attempt it given their context (probably, it would also be a good idea to ask them whether they’ve been studying for a certain exam lately and how much time they invest in doing so). There are a lot of possibilities, it really made me be excited and I would encourage you not to be discouraged by the $R^2$ score. Sometimes this is a question of trial and error, and this was just one of several trials that you could have attempted.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

UBC-MDS / data-analysis-review-2023