Title and Abstract (0.5/1)-
Please include an informative title for your project. Good - add which machine learning model you plan to use, the train/test split, any cross-validation, hyper-param tuning, as well as metrics you plan to use in 1-2 lines within the abstract as well. (-0.5)
Background (0.5/1) -
Please fix intext citation formatting so that they link to the appropriate citations in the Footnotes. Regarding prior work - what kinds of predictive models were used to predict yield? What were the results? Did you draw inspiration from these models for your approach/How is your approach different? Address the above points. I would suggest finding a specific example of a predictive model for grape yield and highlighting relevant points from that work.
You also mention that prior work is limited by being too computationally expensive or having too many confounding variables - what about your work? How do you deal with these limitations? (-0.5)
Research Problem Statement (& Significance, Purpose of Project) (1/1.5) -
What is the research question driving your project? What do you plan to do? You address the need but don't clearly state how. Reword in this manner - We plan to use a RF model to .......
Guiding Rubric - Presents a significant research problem related to the chemical sciences. Articulates clear, reasonable research questions given the purpose, design, and methods of the project. All variables and controls have been appropriately defined. Proposals are clearly supported from the research and theoretical literature. All elements are mutually supportive.
(-0.5)
Data (0/1.5) -
The first line of this section is vague and needs to be rewritten - "The data for this project will be sourced from a vineyard and consists of three separate datasets that must be merged in order to perform analysis. "
Missing title and links to datasets (-1)
This section needs to be more detailed - For each dataset, first include the title and link, as well as state the number of observations and features in bold.
Then describe the features, and their type (numeric/categorical), pointing out the critical variables.
At the end of this section, mention the kinds of data pre-processing that will be done on features of interest (one-hot-encoding, normalizing etc,) Are there any other feature engineering techniques you plan to use? Since this is a classification problem, how will you address any class imbalance issues? Mention these points. (-0.5)
~I'm concerned about the amount of data you have - we're ideally looking at 10k observations. Your original dataset only has 4k observations - this is FAR too little to conduct any significant analysis on.
You need to have alternate dataset sources for your problem and ensure you have at least 10k observations after cleaning and wrangling~
Edit - Sorry guys was up super late grading and misread the rubric - you can certainly conduct your analysis on 4k observations. Just make sure to conduct the appropriate cross-validation techniques given the size and nature of your dataset.
Proposed Solution (1/1.5) -
Good start - but include details about the kind of train/test split, cross-validation technique, hyperparam tuning you plan to use as you work on the project.(-0.5)
First implement DT, and LogReg as baselines, evaluate based on metrics and then implement RF (as you mentioned).
Note: Your model is not the benchmark model; a benchmark is a model that has already been used before on this problem, of which you would be comparing to. Are you using a prior benchmark in this case/is it appropriate to your problem? - This should be addressed in the background section.
Metrics (1/1.5)
OK. Clearly state False Positives and False Negatives in the context of your problem. Include mathematical formulae for each metric. (-0.5)
Ethics and Privacy (0.5/0.5)
Good start - how do you plan to deal with confounds in your model? Address them. Keep adding to this section as you work on your project.
Team Expectations - OK.
Timeline - OK.
Other Comments - Interesting topic- but make sure you have enough data for this project.
You can reply to this feedback below. Make sure to implement all of the feedback mentioned here so you can make up these points during Checkpoint. Contact me anytime if you want help improving your project or have any questions at all!
~Project Proposal Grade - 5/9~ Updated Project Proposal Grade - 9/9
Title and Abstract (0.5/1)- Please include an informative title for your project. Good - add which machine learning model you plan to use, the train/test split, any cross-validation, hyper-param tuning, as well as metrics you plan to use in 1-2 lines within the abstract as well. (-0.5)
Background (0.5/1) - Please fix intext citation formatting so that they link to the appropriate citations in the Footnotes. Regarding prior work - what kinds of predictive models were used to predict yield? What were the results? Did you draw inspiration from these models for your approach/How is your approach different? Address the above points. I would suggest finding a specific example of a predictive model for grape yield and highlighting relevant points from that work. You also mention that prior work is limited by being too computationally expensive or having too many confounding variables - what about your work? How do you deal with these limitations? (-0.5)
Research Problem Statement (& Significance, Purpose of Project) (1/1.5) - What is the research question driving your project? What do you plan to do? You address the need but don't clearly state how. Reword in this manner - We plan to use a RF model to .......
Guiding Rubric - Presents a significant research problem related to the chemical sciences. Articulates clear, reasonable research questions given the purpose, design, and methods of the project. All variables and controls have been appropriately defined. Proposals are clearly supported from the research and theoretical literature. All elements are mutually supportive. (-0.5)
Data (0/1.5) - The first line of this section is vague and needs to be rewritten - "The data for this project will be sourced from a vineyard and consists of three separate datasets that must be merged in order to perform analysis. "
Missing title and links to datasets (-1) This section needs to be more detailed - For each dataset, first include the title and link, as well as state the number of observations and features in bold. Then describe the features, and their type (numeric/categorical), pointing out the critical variables. At the end of this section, mention the kinds of data pre-processing that will be done on features of interest (one-hot-encoding, normalizing etc,) Are there any other feature engineering techniques you plan to use? Since this is a classification problem, how will you address any class imbalance issues? Mention these points. (-0.5)
~I'm concerned about the amount of data you have - we're ideally looking at 10k observations. Your original dataset only has 4k observations - this is FAR too little to conduct any significant analysis on. You need to have alternate dataset sources for your problem and ensure you have at least 10k observations after cleaning and wrangling~
Edit - Sorry guys was up super late grading and misread the rubric - you can certainly conduct your analysis on 4k observations. Just make sure to conduct the appropriate cross-validation techniques given the size and nature of your dataset.
Proposed Solution (1/1.5) - Good start - but include details about the kind of train/test split, cross-validation technique, hyperparam tuning you plan to use as you work on the project.(-0.5) First implement DT, and LogReg as baselines, evaluate based on metrics and then implement RF (as you mentioned).
Note: Your model is not the benchmark model; a benchmark is a model that has already been used before on this problem, of which you would be comparing to. Are you using a prior benchmark in this case/is it appropriate to your problem? - This should be addressed in the background section.
Metrics (1/1.5) OK. Clearly state False Positives and False Negatives in the context of your problem. Include mathematical formulae for each metric. (-0.5)
Ethics and Privacy (0.5/0.5) Good start - how do you plan to deal with confounds in your model? Address them. Keep adding to this section as you work on your project.
Team Expectations - OK.
Timeline - OK.
Other Comments - Interesting topic- but make sure you have enough data for this project.
You can reply to this feedback below. Make sure to implement all of the feedback mentioned here so you can make up these points during Checkpoint. Contact me anytime if you want help improving your project or have any questions at all!