Open colinmegill opened 4 days ago
Remarkably stable across multiple calls to Claude 3.5 sonnet on any given dataset ! GPT4 and Gemini not super great, but possibly because prompt iterated against Claude 3.5.
TLDR sanity check "evaluation":
Harder to evaluate but needed: evaluating coverage
Examples of stability accross calls to Claude 3.5 Sonnet on the Bowling Green report:
and on New Zealand report
Example of Gemini Advanced output not meeting the expectation (with the caveat that the prompt was developed against Claude 3.5 Sonnet, so not entirely shocking it doesn't port):
And example of GPT4, bad formatting, has quotes but hidden in the code due to bad HTML (seen below), same caveat: HTML from ChatGPT:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Pol.is Conversation Summary</title>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/tufte-css/1.8.0/tufte.min.css">
<style>
article {
margin: auto;
max-width: 40em;
}
</style>
</head>
<body>
<article>
<h1>Pol.is Conversation Summary</h1>
<p>Bowling Green Civic Assembly organized a conversation on improving Bowling Green and Warren County, involving 1,585 participants casting 225,608 votes.</p>
<p>Topics included internet access, fairness ordinances, traffic flow, housing, and arts in education.</p>
<p>The participants divided into two distinct groups: Group A (755 members), who <span class="sidenote" data-sidenote="20: It is embarrassing that our city is the largest in the state not to have a fairness ordinance (A - 80% agreed) [agreed].">prioritized fairness ordinances</span>, and Group B (830 members), who <span class="sidenote" data-sidenote="200: Bowling Green needs more competitive cable rates (B - 78% agreed) [agreed].">focused on improving cable rates</span>.</p>
<p>Despite these divisions, strong consensus emerged around key issues, such as the importance of arts in education (<span class="sidenote" data-sidenote="21: The arts are an important component of K-12 education (all - 75% agreed) [agreed].">75% agreed</span>) and the need for improved internet access (<span class="sidenote" data-sidenote="64: More choices when it comes to internet. BGMU has been offering service to businesses for a while; they should expand to residents (all - 77% agreed) [agreed].">77% agreed</span>).</p>
<p>Areas of significant uncertainty included city planning initiatives, such as the role of non-compete clauses in driving economic hardship (<span class="sidenote" data-sidenote="48: City should bar non-competes, similar to North Dakota/California, as driving destitution in non-tenure workforce (all - 74% passed) [passed].">74% passed</span>).</p>
</article>
</body>
</html>
Feedback from @DZNarayanan on the stability of evaluations in the Bowling Green example: while it appears stable to a general eye, for a specialist who knows that conversation quite well the several summaries show actually quite a lot of variability in what they put forward, and are not all aligned with what was actually put forward in the human report.
This points to the need for thorough evaluation, both qualitative and quantiative.
As discussed with @colinmegill and @DZNarayanan, we know that LLMs do not come with formal guarantees, unlike PCA and k-Means, so we have to be deliberate into how we use them, what empirical guarantees we require before using them.
This issue is a feature! Paste append the raw text (copy and paste) of any automatically generated polis after this prompt.
Here's a report to test! https://pol.is/report/r7bhuide6netnbr8fxbyh
Instructions