compdemocracy / polis

:milky_way: Open Source AI for large scale open ended feedback
https://pol.is
GNU Affero General Public License v3.0
782 stars 186 forks source link

Zero shot polis report to tldr summary #1842

Open colinmegill opened 4 days ago

colinmegill commented 4 days ago

This issue is a feature! Paste append the raw text (copy and paste) of any automatically generated polis after this prompt.

sji-tldr

Here's a report to test! https://pol.is/report/r7bhuide6netnbr8fxbyh

Instructions

  1. Copy prompt into text editor
  2. Visit report, select all, copy, paste pol.is/report/foo after # BEGIN DATA
  3. Paste into Claude
  4. View rendered website as Claude Artifact
  5. Quality check the comments in side notes for accuracy and remove hallucinations
  6. Share

# TASK: 

Your task is to create a webpage containing a 5 sentence summary of a pol.is conversation, with each sub-clause of each sentence generously (minimum 1, maximum 5 data points) cited from the data itself as a side note. You will be given a schema, per paragraph, and the raw data. 

You must use tufte.css for the citation side notes, here is a valid url to use. Constrain the width of the article to something like 40em.

<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/tufte-css/1.8.0/tufte.min.css">

Put the full text of each comment and the votes per group into a side note. 

Sidenotes should be used to provide the data points that support the main text. Only comments, with their id, and the number of votes per group should be included in the sidenotes. Sidenotes should only be used for comments and votes, nothing else. 

DO NOT GENERATE ANY OTHER TEXT WHATSOEVER. ONLY THE DATA GIVEN IN THE SCHEMA AND DATA SECTIONS SHOULD BE USED TO GENERATE THE HTML.

# Vocabulary
- conversation_owner: the person who ran the pol.is conversation
- topic: the subject of the pol.is conversation
- description: a brief description of the pol.is conversation
- participants: the people who participated in the pol.is conversation (only use this term, not community or anything else)

# BEGIN SCHEMA

# GLOBAL RULES
# 1. Side notes must ONLY contain the raw comment text and vote percentages from the data
# 2. All descriptive text and analysis must be in the main narrative
# 3. Do not include metadata or summaries in side notes
# 4. ONLY the last three sentences should have side notes: the consensus points, areas of uncertainty
# 5. Each side note should be formatted as: "[id]: [comment_text] ([group_letter OR all] - [percentage]% ) [agreed/disagreed]" where group_name is the name of the group that voted on the comment and of everyone is the string "of all participants", for example: "1: This is a comment (A - 50% agreed)"
# 6. Odd numbered side notes should be on the left, even numbered side notes should be on the right

# SENTENCE 1: Introduction & Scope

{
    sentence_template: "[conversation_owner] [topic] [description] with [participant_count] participants casting [vote_count] votes",
    required_data: {
        conversation_owner: string,
        topic: string,
        description: string,
        scope_numbers: {
            quantity: number,
            unit: string,
            subset: {
                quantity: number,
                descriptor: string
            }
        },
        participant_count: number,
        participants_descriptor: string,  // e.g., "participants", "community members"
        vote_count: number
    }
}

# SENTENCE 2: Comprehensive Topics
### General instructions for this sentence: be very detailed. This sentence should be the longest and most detailed of all.

"Topics included [all_topics]. "

{
    all_topics: string[];  // e.g., ["climate change", "public transportation", "housing affordability"]
}

# SENTENCE 3: Group Division
1. the paragraph text should describe how each group is differentiated from the others. 
7. Each group should be described by at least one comment / citation, which appears inside of the sentence after that group description clause. 
8. When talking about a group, use words like differentiated from the others, unique, distinct, etc., the shorter the better. Shared the view is fine too. But the main point is to show how each group is different from the others and your transition words need to emphasize that. "while", "whereas" are good transition words as you move between groups.
9. When you select comments that differnetiate a group, consider the semantics in the light of the entire conversation. Which comments are the most representative of the group's views RELATIVE to the other groups?
10. When you refer to each group, use the letter and the size. You MAY give a group a nickname like "pragmatists (C, N members)" or "idealists (D, N members)" and refer to them by that nickname in the sentence IF AND ONLY IF you are VERY CONFIDENT about their personality from the comments taht differntiated them. Otherwise reference the number, e.g., "the largest group (A, N members)" is fine.

The partipants divided into [group_count] distinct group: [group_descriptions]

# SENTENCE 4: Consensus Areas
1. Pick comments which are substantive. Comments that are more general or less specific should be avoided. Comments with heft and weight which gain consensus should be prioritized in your selection.
2. When summarizing consensus, you MUST NOT generalize that an entire area or topic, but stay specifically to the comments that had consensus across the groups. The comment text itself is of absolute importance, because that is what the particants agreed upon. DO NOT suggest that there is more consensus than there actually, by generalizing. DO NOT GENERALIZE. YOU MUST BE SPECIFIC.
3. When suggesting a consensus comment, you MUST NOT select a comment which differentiates a group. The definion of "group informed consensus" in polis is not "majority", it is "majority across all groups". So, if a comment is divisive, it cannot be a consensus comment.

{
    consensus_template: "Despite this division, strong consensus emerged around key issues, with [consensus_points]",
    required_data: {
        consensus_points: [{
            topic: string,
            percentage: number,
            statement_id: number,
            sentiment: string  // e.g., "supporting", "opposing"
        }],
        threshold: number,  // e.g., 75
        connector_phrases: [string]  // e.g., "over X% of participants"
    }
}

# SENTENCE 5: Areas of Uncertainty
{
uncertainty_template: "Areas of significant uncertainty included [uncertainty_list] [special case given conversation topic and context]",
required_data: {
uncertainty_areas: [{
topic: string,
uncertainty_type: {
type: string,  // e.g., "pass", "split", "contested"
percentage: number,
statement_id: number
}
}],
special_case: {
topic: string,
sentiment: string,
percentage: number,
statement_id: number
}
}

# VALIDATION RULES
1. Each sentence must:
   - Contain at least one quantifiable metric
   - Reference specific statement IDs
   - Connect logically to adjacent sentences
2. Maximum word counts per sentence:
   - Sentence 1: 30 words
   - Sentence 2: 55 words
   - Sentence 3: 55 words
   - Sentence 4: 35 words
   - Sentence 5: 35 words

# BEGIN DATA
jucor commented 4 days ago

Remarkably stable across multiple calls to Claude 3.5 sonnet on any given dataset ! GPT4 and Gemini not super great, but possibly because prompt iterated against Claude 3.5.

jucor commented 4 days ago

TLDR sanity check "evaluation":

Harder to evaluate but needed: evaluating coverage

jucor commented 4 days ago

Examples of stability accross calls to Claude 3.5 Sonnet on the Bowling Green report:

image image image image

and on New Zealand report

image

jucor commented 4 days ago

Example of Gemini Advanced output not meeting the expectation (with the caveat that the prompt was developed against Claude 3.5 Sonnet, so not entirely shocking it doesn't port): image

And example of GPT4, bad formatting, has quotes but hidden in the code due to bad HTML (seen below), same caveat: image HTML from ChatGPT:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Pol.is Conversation Summary</title>
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/tufte-css/1.8.0/tufte.min.css">
    <style>
        article {
            margin: auto;
            max-width: 40em;
        }
    </style>
</head>
<body>
    <article>
        <h1>Pol.is Conversation Summary</h1>
        <p>Bowling Green Civic Assembly organized a conversation on improving Bowling Green and Warren County, involving 1,585 participants casting 225,608 votes.</p>
        <p>Topics included internet access, fairness ordinances, traffic flow, housing, and arts in education.</p>
        <p>The participants divided into two distinct groups: Group A (755 members), who <span class="sidenote" data-sidenote="20: It is embarrassing that our city is the largest in the state not to have a fairness ordinance (A - 80% agreed) [agreed].">prioritized fairness ordinances</span>, and Group B (830 members), who <span class="sidenote" data-sidenote="200: Bowling Green needs more competitive cable rates (B - 78% agreed) [agreed].">focused on improving cable rates</span>.</p>
        <p>Despite these divisions, strong consensus emerged around key issues, such as the importance of arts in education (<span class="sidenote" data-sidenote="21: The arts are an important component of K-12 education (all - 75% agreed) [agreed].">75% agreed</span>) and the need for improved internet access (<span class="sidenote" data-sidenote="64: More choices when it comes to internet. BGMU has been offering service to businesses for a while; they should expand to residents (all - 77% agreed) [agreed].">77% agreed</span>).</p>
        <p>Areas of significant uncertainty included city planning initiatives, such as the role of non-compete clauses in driving economic hardship (<span class="sidenote" data-sidenote="48: City should bar non-competes, similar to North Dakota/California, as driving destitution in non-tenure workforce (all - 74% passed) [passed].">74% passed</span>).</p>
    </article>
</body>
</html>
jucor commented 4 days ago

Feedback from @DZNarayanan on the stability of evaluations in the Bowling Green example: while it appears stable to a general eye, for a specialist who knows that conversation quite well the several summaries show actually quite a lot of variability in what they put forward, and are not all aligned with what was actually put forward in the human report.

This points to the need for thorough evaluation, both qualitative and quantiative.

As discussed with @colinmegill and @DZNarayanan, we know that LLMs do not come with formal guarantees, unlike PCA and k-Means, so we have to be deliberate into how we use them, what empirical guarantees we require before using them.