Evaluate Efficiency of LLMs to grade assignments using Prompting Techniques

Objective

To experimentally determine the effectiveness of Large Language Models (LLMs) in grading various types of assignment submissions and to assess their performance relative to human graders.

Background Resources

Relevant Issue : Evaluate efficacy of SAFE for essay grading
Sample Prompts and Tests: Please refer to the provided PDFs for details on the sample prompts and tests used in these experiments. View Documents
Hypothesis Testing Overview: For an understanding of hypothesis testing methods to be employed, review this resource: What is Hypothesis Testing?

Experimental Design

Hypothesis Formation: Develop clear hypotheses based on the potential outcomes of LLMs in assignment grading. Here is a Wiki Document for starters Hypotheses for Testing Automated Assignment Grading Software
Data Collection:
- Conduct tests using both the whole assignment and segmented parts.
- Collect grading outcomes from LLMs and compare these with benchmarks set by human graders.
Analysis:
- Use statistical methods to analyze the data collected and validate the hypotheses.
- Focus on evaluating how closely LLM grading aligns with human grading standards.

Goals

Primary Goal: Determine the feasibility of replacing or augmenting human grading with LLMs.
Secondary Goal: Explore the potential for customizing the grading style of LLMs to match different grading preferences, emulating the style of various professors.

Future Work

In a subsequent phase, investigate methods to adapt LLM grading styles based on specific user preferences and grading styles of different educators.

Expected Outcomes

Comprehensive analysis report detailing the performance of LLMs in grading assignments relative to human graders.
Insights into the adaptability of LLMs for personalized grading styles.

This issue aims to methodically assess the capabilities of LLMs in an educational setting, focusing on their potential to enhance or replace traditional grading methods while maintaining or improving grading accuracy and personalization.

autograder-org / autoGrader-frontend