Collect a bunch of contexts to generate questions for -> list of texts
In batch, ask our LMs to generate questions for each context; aggregate the results. -> list of, for each text, list of questions and metadata (what system generated it). Ask each LM for several questions, log all of them.
Streamlit app: pick a random context from the list, pick some subset of the questions (random?) ask user which one is "best" (most appropriate, most helpful, ...?) -> user's choice of which question. Maybe have them rank the questions? (drag them into best-to-worst?) (or: pick the best)