Automatically generate draft answers for student questions

With the increasing capabilities of LLMs, it is only a matter of time before they become powerful/cheap enough to use them inside Dodona. A first step might be to generate draft answers for questions from students. Here's how it might function:

A student asks a question about a line of code
This triggers a job to generate a draft answer based on the student's code and their question as input (do we also need to add the problem description?)
When a TA is ready to answer the question, the draft is pre-loaded and clearly labeled as an AI-generated draft
The TA then has the option to either approve, modify, or discard the draft response.

This approach minimizes risk since each AI-generated answer undergoes human review and editing. Moreover, it's not time-sensitive. If the AI draft is inadequate or fails, the situation remains as it is currently. However, the potential time savings could be substantial.

Since this would be our first LLM integration, this will involve some research aspects.

Effectiveness Assessment: We must evaluate the quality of the drafts. This involves preserving every AI-generated draft alongside the TA's final response for future analysis and potentially soliciting feedback from TAs.
Model Selection: Deciding which model to deploy – be it a local instance of Code Llama, GPT-3, GPT-4, or another – requires careful consideration. We could conduct experiments using a selection of existing questions from our database to compare and assess the responses generated by different models.
Prompt Optimization: Determining the most effective system prompts for query generation.
Cost Analysis: What is the cost of using gpt3 and 4? Is code llama on a mac studio fast enough?
Continuous Evaluation: Developing a method for assessing new models or prompts as they emerge, potentially through explicit A/B testing where TAs are asked to judge the quality of responses.

Some old code I wrote to generate answers based on questions as a stand-alone script:

import OpenAI from "openai";

import { JSDOM } from 'jsdom';

const dodonaHeaders = new Headers({
  "Authorization": ""
});

const openai = new OpenAI({
  apiKey: ""
});

const systemPrompt = "Your goal is to help a teaching assistant answer student questions for a university-level programming course. You will be provided with the problem description, the code of the student, and the question of the student. Your answer should consist of 2 parts. First, very briefly summarize what the student did wrong to the teaching assistant. Second, provide a short response to the question aimed at the student in the same language as the student's question.";

const questionId = 148513;

async function fetchData(questionId) {
  // fetch question data from https://dodona.be/nl/annotations/<ID>.json
  let r = await fetch(`https://dodona.be/nl/annotations/${questionId}.json`, {headers: dodonaHeaders});
  const questionData = await r.json();
  const lineNr = questionData.line_nr;
  const question = questionData.annotation_text;
  const submissionUrl = questionData.submission_url;

  // fetch submission data
  r = await fetch(submissionUrl, { headers: dodonaHeaders });
  const submissionData = await r.json();
  const code = submissionData.code;
  const exerciseUrl = submissionData.exercise;

  // fetch exercise data
  r = await fetch(exerciseUrl, { headers: dodonaHeaders });
  const exerciseData = await r.json();
  const descriptionUrl = exerciseData.description_url;

  // fetch description
  r = await fetch(descriptionUrl, { headers: dodonaHeaders });
  const descriptionHtml = await r.text();
  const description = htmlToText(descriptionHtml);

  return {description, code, question, lineNr};
}

async function generateAnswer({description, code, question, lineNr}) {
  const response = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [
      {"role": "system", "content": systemPrompt},
      {"role": "user", "content": `Description: ${description}\nCode: ${code}\nQuestion on line ${lineNr}: ${question}`}
    ]
  });
  console.log(response);
  console.log(response.choices[0].message);
  //return gptResponse.data.choices[0].text;
}

function htmlToText(html) {
  const dom = new JSDOM(html);
  const text = dom.window.document.body.textContent
    .split("\n")
    .map(l => l.trim())
    .filter(line => !line.includes("I18n"))
    .filter(line => !line.includes("dodona.ready"))
    .join("\n");
  return removeTextAfterSubstring(text, "Links").trim();
}

function removeTextAfterSubstring(str, substring) {
  const index = str.indexOf(substring);

  if (index === -1) {
    return str;  // substring not found
  }

  return str.substring(0, index);
}

const data = await fetchData(questionId);
console.log(data);
await generateAnswer(data)

I tested the runtime performance of a few models on my mac studio (64GB memory):

Model	Quantization	Memory usage	Inference
codellama-34b-instruct	Q5_K_M	22.13 GB	9.87 tok/s
codellama-34b-instruct	Q6_K	25.63 GB	9.58 tok/s
codellama-34b-instruct	Q8_0	33.06 GB	9.32 tok/s
codellama-70b-instruct	Q4_K_M	38.37 GB	7.00 tok/s
codellama-70b-instruct	Q6_0	49.39 GB	crashed
mixtral-8x7b-instruct	Q5_K_M	29.64 GB	21.5 tok/s

I could not validate the output of codellama-70b since it seems to use a different prompt format.

I played around with the various models this afternoon. Some early observations:

I tweaked the system prompt from above and left out the "2 part answer" and only focused on letting it generate a draft.
I couldn't get codellama to answer in Dutch, mixtral did fine
I did a quick search for existing questions which we could use to evaluate the models, but was a bit disappointed. The data quality is very low.
- many questions have no answer (often given in person or because the student solved the exercise already)
- questions are sometimes asked in multiple messages (a follow up message is added before the question is answered)
- questions are added to a line, but the question has nothing to do with that line
- often, the question assumes external knowledge which isn't explicitly mentioned. This makes it hard to answer. For example, a question like "I don't know why this fails" where "this" is actually one of the tests.
- most importantly: it is often not clear what the question is
We'll probably have to include the problem description in the prompt to add extra content. Unfortunately, this takes up a huge amount of tokens for many exercises. There are multiple reasons for this: a lot of exercises are in html which is much more verbose than markdown. A lot of exercises contain a whole lot of irrelevant information. The examples can be very long, token wise. The problem with these huge prompts is they are too large for self-hosted models and are expensive for openAI models. Even with some strategies to reduce the problem description token count, I encountered descriptions of 5000+ tokens.
Problem descriptions in English work better than those in Dutch, but we can't use them because they sometimes require different function names.
It often takes 40-50 seconds to get a response, both for gpt4 as well as mixtral

dodona-edu / dodona

Automatically generate draft answers for student questions #5331