Open rbouwer opened 11 months ago
In readme.md, you specify the project's license on MIT license but not specify the license on your database. For the license part, I did a little bit research, your database and also article that use this database is funded by some organization and put it as the free PMC (PubMedCentral) article, which freely available for public, and I found the disclaimer on this "https://pubmed.ncbi.nlm.nih.gov/disclaimer/". I don't know how to correctly provided it in license.md, may be you create a section for the data you used and specify the disclaimer of free PMC over there (or better ask tiff on how exactly you should do this). Also this is the link to term of use on the website you downloaded "https://github.com/CenterForOpenScience/cos.io/blob/master/TERMS_OF_USE.md" but as your data is from third party, not from this website directly, so it's useless to look at it (in my opinion). But I posted here, just in case, you want to check by yourselves. Hope it helps!!!
Don’t forget to put your names as authors under your project name in the report.
For the report, you may provide some visual on some distributions, which you mainly focused, it may improve how to visualise the report a bit.(optional)
Your repository name should change to relate to the project (change it into your project name).
You can add contact channel on readme (you already have contact topic but no instruction on how to contact)
Introduction in report is already very good but, in my opinion, if you add on a bit usage of this analysis (that you provided in summary) in a bit of introduction to make awareness to reader on how useful your analysis it can be, it would be great. (optional)
I also saw you provided the limitation of the analysis but, in my opinion, if you seperate it as subsection to bold out and make it easier for reader to read, it may be better to find limitation on your report. (optional)
In README.md about the paragraph is concatenate in github, you should split it.(I understand that it will cause from error on markdown, may be you should give it two extra white line to separate it)
love your README section,very well organised and very easy to read, I will adapt to my group too
Finally, I learned a great deal from reading your work and I will adapt it to my group project too. Thanks a lot for your great work!!!!
This was derived from the JOSE review checklist and the ROpenSci review checklist.
1.5
This was derived from the JOSE review checklist and the ROpenSci review checklist.
1.5
In addition to all the issues aforementioned, I have following suggestions:
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Some typos. Small typo ("... want toto close...") in the note after the third point in the docker method of the instructions. There are also some minor typos in the report (ex. "predited scores" in the paragraph that comes after Fig. 5).
I could not run your tests. In your test codes, you are importing modules from src, but they are not in src, they are in src/helper.
Regarding the data, you said that you dropped some columns because they were filled with zeros. Probably, that happened because your original set was too big and you got a sample of 30% to work with. Although there are not many, for example, dyslexic, maybe that would be a variable with a high impact on those who are dyslexic. Do not incorporate for this point as it would probably not increase $R^2$ since there aren't many dyslexics in your dataset, Still, it would have been interesting to investigate that variable.
Regarding the correlation matrix, I would suggest changing the color scheme. Apparently, red means "strong positive correlation". However, there are positive and negative values in blue. Also, it would be better if a dark shade of the color were used for strong correlations (either positive or negative), while using light shades for not so correlated pairs. At this moment, a stronger/weaker correlation may lead to either shade of color.
Regarding Fig. 5, it would be interesting if you discussed more about why there seems to be a straight line acting as a boundary on the upper right-hand side of your scatterplot. Also, the graph seem a little bit crowded, it would probably be better if you made it a little bit bigger and if you made the dots more translucent or changed the color scheme to one that allow us to see the density (how crowded) of different sections of the plot (maybe a 2-D histogram would be better).
Although not extremely necessary to include, distribution plots could really help identify patterns in continuous variables. It would also let you decide on whether you should apply a transformation (ex. log transformation, or kbinsdiscretizer) to some of your variables.
Since you are dealing with some categorical variables, I think it would have also helped to include some sort of general plot (maybe a pie chart just to get an idea of how the data is distributed). Maybe some of your variables are actually not that useful because most of them belong to a particular category.
I found your project remarkably interesting. It caught my attention from the very beginning, and it also made me be excited (regardless of the results). I can understand that the $R^2$ may have been disappointing after all your hard work. However, it is not because you did it wrong or because your original idea is not correct. There is definitely something in there. Though, it is understandable that computational and time constraints did not let you fully explore the universe of your problem. If refined enough, the ideas in this project could help give people an idea of how much they would score in an exam before they attempt it given their context (probably, it would also be a good idea to ask them whether they’ve been studying for a certain exam lately and how much time they invest in doing so). There are a lot of possibilities, it really made me be excited and I would encourage you not to be discouraged by the $R^2$ score. Sometimes this is a question of trial and error, and this was just one of several trials that you could have attempted.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Submitting authors: @rbouwer @farrandi @atabak-alishiri @salva-u
Repository: https://github.com/UBC-MDS/522-workflows-group-18 Report link: https://ubc-mds.github.io/522-workflows-group-18/docs/english_language_learning_ability_prediction_analysis.html Abstract/executive summary: The analysis details the construction of a linear regression model to predict an individual’s English Proficiency Score, considering factors like age, education, and language background. The final model employed Ridge linear regression with L2 regularization, achieving an optimal alpha value of 1.546352. Performance evaluation used two metrics: R-squared score and Root Mean Squared Error (RMSE). The model’s R-squared value was 0.2424, explaining about 24.24% of the variance in correct English Proficiency Scores, while the RMSE indicated an average prediction error of 5.3178%. Our model did not perform that well as seen from our RMSE and R-squared scores. This might be because the features do not have a linear relation to the score so we are thinking of improving this model by using Polynomial Feature Transform and/or ensemble methods. However, analysis revealed that the model performed better for higher actual English Proficiency Scores, making it potentially useful as an initial tool in the analysis of individuals wishing to learn English. The model could guide resource allocation or the level of guidance necessary for efficient English learning. The most significant features in the dataset related to English Proficiency Scores were found to be the “Eng_little” encoding, indicating the individual’s current level of English (e.g., native, immersion learner, non-immersion learner).
Editor: @ttimbers Reviewer: Marco Polo Bravo Montiel (@marcony1), Weilin Han (@hwl1008), Kittipong Wongwipasamitkun (@kwjo), Chris Gao (@chrisgqy)