Integrate language models

colinmegill commented 4 years ago

[ ] suggest tags for comments https://twitter.com/colinmegill/status/1285032126774218760
[ ] suggest possible expertise for participants based on twitter profile https://twitter.com/colinmegill/status/1290881080854183936

patcon commented 4 years ago

More potential use-cases

statement suggestions
- [ ] suggesting clearer, more atomic variants of statements https://twitter.com/colinmegill/status/1284631075176755205
- [ ] inverting negation in statements (e.g. someone submits their perspective in the negative, then GPT-3 inverts and we ask if it holds true, and we submit the inversion and invert their vote on it) https://twitter.com/colinmegill/status/1284631075176755205
- [ ] suggest reframing "we" statements to "i" statements https://twitter.com/patconnolly/status/1286730393429979139
other
- [ ] generate complete mock conversations for given topic https://twitter.com/patconnolly/status/1286498257678536704

patcon commented 4 years ago

Came across another blue-sky use-case:

orientation on best-practices for participation
- e.g., use GPT-3 to generate "overloaded" or "leading" statements derived from statements that the participants has entered as their own views. Then walk through how they would respond to those generated statements if they came across them in an active polis conversation.
- this would allow a really clear walkthrough where people don't need to "pretend" they have view X in order understand the way they might engage with a statement
- tl;dr - could get the user to enter a few actual beliefs they have on a specific demo topic, and use those beliefs to generate templates of all the ways they'd interact with conflicting beliefs of others in polis :)

colinmegill commented 4 years ago

I did a quick count on GPT-3's tagging of the comments in the Bowling Green dataset: https://gist.github.com/colinmegill/7714eb0962573346b210aa989e14dadf

Here are the human generated categories:

``` Traffic and Transportation Zoning and Development Cable and Broadband Government Accountability Jobs and Training Community Services Groceries Trash and Littering Liquor Laws Homelessness Metro Governance Local Schools WKU Budget Crisis Policing Drug Sentencing Rent Laws Wages Immigration and Refugees Marijuana Legalization Fairness Ordinance (LGBTQ rights) ```

GPT-3 generated categories, and the number of times GPT-3 used the categories across all comments:

``` const bowlingGreenCommentTopicsGPT3 = { infrastructure: 96, "": 94, transportation: 51, education: 44, schools: 37, drugs: 29, publicsafety: 27, public: 27, environment: 24, traffic: 22, housing: 22, crime: 19, food: 18, animals: 16, health: 16, arts: 15, jobs: 15, parks: 15, zoning: 15, poverty: 15, government: 15, roads: 13, internet: 13, recreation: 12, homelessness: 12, taxes: 12, politics: 12, downtown: 12, parking: 11, opiods: 11, economy: 11, safety: 9, community: 9, lawenforcement: 8, business: 8, development: 8, tourism: 8, youth: 7, police: 7, recycling: 7, sports: 7, technology: 7, immigration: 6, marijuana: 6, landuse: 6, trees: 6, healthcare: 6, economicdevelopment: 6, restaurants: 5, lgbt: 5, neighborhoods: 5, elections: 4, biking: 4, city: 4, athletics: 4, minimumwage: 4, sidewalks: 4, equality: 4, growth: 4, localgovernment: 4, emergencyservices: 3, pets: 3, airports: 3, science: 3, urbansprawl: 3, swimming: 3, urban: 3, communications: 3, family: 3, publictransit: 3, culture: 3, smoking: 3, discrimination: 3, waste: 3, music: 3, voting: 3, litter: 3, entertainment: 3, airport: 3, smallbusiness: 3, wages: 3, children: 3, work: 3, economics: 3, businesses: 3, race: 3, roadways: 3, trafficcongestion: 3, nonprofits: 2, seniors: 2, workforce: 2, historicpreservation: 2, cable: 2, language: 2, trafficsafety: 2, religion: 2, realestate: 2, opioids: 2, county: 2, media: 2, artsandculture: 2, commerce: 2, broadband: 2, farmland: 2, policy: 2, land: 2, taxation: 2, housingaffordability: 2, groceries: 2, publicschools: 2, propertyrights: 2, libraries: 2, regulation: 2, transit: 2, pollution: 2, aging: 2, agriculture: 2, gambling: 2, doctors: 2, bicycles: 2, publicworks: 2, property: 2, "community-and-soc": 2, travel: 2, refugee: 2, immigrants: 2, policing: 2, communitydevelopment: 2, workers: 2, environmentalprotection: 2, law: 2, design: 2, homeless: 2, employment: 2, social: 2, hunger: 2, courts: 2, civilrights: 2, shopping: 2, retirement: 2, labor: 2, art: 2, nutrition: 2, wastemanagement: 2, civil: 2, budget: 2, railroads: 2, charities: 1, enviroment: 1, civicengagement: 1, youthcenters: 1, "after-": 1, stateregulations: 1, square: 1, hospitality: 1, telecommunications: 1, engineering: 1, trafficandparkingarts: 1, welfare: 1, animalwelfare: 1, cornerstones: 1, unitedschooldistrict: 1, drugpolicy: 1, votersguide: 1, flooding: 1, pumping: 1, "land-use": 1, garbage: 1, san: 1, animal: 1, ecology: 1, responsible: 1, lgbtqai: 1, physicaleducation: 1, rentalproperties: 1, railroad: 1, veterans: 1, growthdevelopment: 1, medicene: 1, gvm: 1, unions: 1, universities: 1, dining: 1, heroin: 1, op: 1, seating: 1, meetings: 1, mentoring: 1, weather: 1, sexual: 1, healthyfood: 1, license: 1, driving: 1, press: 1, publictransportation: 1, urbandecay: 1, genderrelations: 1, fratern: 1, codeenforcement: 1, after: 1, landlordandtenant: 1, rural: 1, congresman: 1, federalgovernment: 1, pools: 1, louisvillemetrocouncildistrict16: 1, identification: 1, solarpower: 1, fluoridation: 1, dvc: 1, farmersmarket: 1, green: 1, signage: 1, americanliberalartscolleges: 1, parksrecreation: 1, mlive: 1, seniorcitizens: 1, sexuality: 1, sexeducation: 1, termlimits: 1, birthcontrol: 1, mentalhealth: 1, psychology: 1, r16traffictownship: 1, construction: 1, farming: 1, wirelesstechnology: 1, drugepidemic: 1, greenspace: 1, assistance: 1, programs: 1, communityhealth: 1, studentlife: 1, nature: 1, pollutants: 1, buildings: 1, walmarts: 1, separationofchurch: 1, performingarts: 1, publicwork: 1, roadculture: 1, publiceducation: 1, publicparks: 1, affordablefood: 1, fooddeserts: 1, riverfront: 1, fiber: 1, airquality: 1, cityschools: 1, hollystreet: 1, carstrucks: 1, hivaids: 1, longtermplanning: 1, businesssignage: 1, localbusiness: 1, elders: 1, stds: 1, gender: 1, lgbtq: 1, rail: 1, planning: 1, pestcontrol: 1, cityofhouston: 1, animalcontrol: 1, campus: 1, fuel: 1, gasoline: 1, "infrastructure-design": 1, sexcrimes: 1, fraud: 1, mis: 1, college: 1, anxiety: 1, university: 1, cooperation: 1, exercise: 1, "community-development": 1, casinos: 1, theater: 1, cleanliness: 1, election: 1, ballot: 1, landlordten: 1, noise: 1, ordinances: 1, rentcontrol: 1, animalrights: 1, bettercity: 1, recyclingscript: 1, buses: 1, holidays: 1, halloween: 1, suburb: 1, econom: 1, airline: 1, greenspaces: 1, history: 1, bigotry: 1, buildinginspection: 1, warren: 1, winterrecreation: 1, criminaljustice: 1, publichealth: 1, barbaranat: 1, unaffiliatedindividual: 1, incentives: 1, museums: 1, publicservice: 1, trans: 1, racialequality: 1, workplace: 1, governmentreform: 1, kansasassetverificationprogram: 1, salaries: 1, populationgrowth: 1, jobgrowth: 1, redistricting: 1, redist: 1, taxrevenues: 1, salud: 1, alcohol: 1, drinkingage: 1, publicofficials: 1, tspl: 1, highereducation: 1, tuition: 1, newsmedia: 1, foodsecurity: 1, medicine: 1, humanservices: 1, hou: 1, addiction: 1, faith: 1, current: 1, internetservice: 1, computers: 1, waterfront: 1, environmental: 1, climate: 1, senior: 1, rentalhousing: 1, wcl: 1, teachers: 1, consumerissues: 1, localgoverment: 1, fairnessordinance: 1, civildiscourse: 1, socialmedia: 1, chain: 1, wholefoods: 1, theft: 1, schoolfunding: 1, tech: 1, innovation: 1, climatechange: 1, bulling: 1, opportunity: 1, benefits: 1, energy: 1, alternative: 1, oncampus: 1, teens: 1, research: 1, unemployment: 1, massmedia: 1, access: 1, theeconomy: 1, broadcasting: 1, mediadiversity: 1, "ex-offenders": 1, "city-council": 1, suburbs: 1, springfield: 1, "public-schools": 1, reproductivehealth: 1, women: 1, traffictickets: 1, connectivity: 1, homeimprovement: 1, revitalization: 1, tax: 1, care: 1, ambulance: 1, cte: 1, fireworks: 1, publichousing: 1, earlychildhood: 1, regulations: 1, tennis: 1, outdoor: 1, thoughtsonsub: 1, disability: 1, bgpd: 1, planningand: 1, utilitiesann: 1, corruption: 1, "community-organizing": 1, fitness: 1, pedestrians: 1, jobcreation: 1, greenenergy: 1, "new-park": 1, urbanites: 1, obesity: 1, sugar: 1, neighborhoodcenters: 1, kids: 1, water: 1, k: 1, foodaccess: 1, nursing: 1, adjustable: 1, privacy: 1, trafficlights: 1, blight: 1, fees: 1, immediatehelp: 1, justicesystem: 1, plano: 1, utilities: 1, abuse: 1, sanitation: 1, curriculum: 1, greenfield: 1, drugaddiction: 1, speedlimit: 1, heritagepreservation: 1, homes: 1, goodoleboys: 1, consumerprotection: 1, c: 1, leisure: 1, refugees: 1, wireless: 1, diversity: 1, charity: 1, citygovernment: 1, assistanacetoneedy: 1, play: 1, bicy: 1, cities: 1, sp: 1, smallcities: 1, "courts-and": 1, workforcedevelopment: 1, crimeandcriminals: 1, localcontrol: 1, walking: 1, disabilities: 1, citycouncil: 1, lexington: 1, rentcontrols: 1, bicycling: 1, projectlist: 1, consumer: 1, socialactivities: 1, prisons: 1, eco: 1, corporations: 1, livablecities: 1, racial: 1, governmentpolicy: 1, renting: 1, planningzoning: 1, executivebranch: 1, humanrights: 1, fairness: 1, inequality: 1, rehabilitation: 1, release: 1, rapid: 1, events: 1, taxincentives: 1, uns: 1, governmentcorruption: 1, density: 1, bowlinggreen: 1, advertising: 1, parksandrec: 1, new: 1, fooddesert: 1, clothing: 1, textbooks: 1, sewerandwater: 1, medicalmarijuana: 1, renters: 1, civ: 1, emergencymanagement: 1, propertyvalues: 1, recreational: 1, stpetersburg: 1, future: 1, townhomes: 1, }; ```

colinmegill commented 4 years ago

colinmegill commented 3 years ago

Narrative summary target cc @micahstubbs

https://twitter.com/bengglover/status/1364915200248446980

colinmegill commented 3 years ago

Same statement already submitted in different words

NewJerseyStyle commented 3 years ago

I saw this (GPT-2 similar) claiming that they can do unsupervised clustering for texts (for Simplified Chinese texts)

https://github.com/TsinghuaAI/CPM

For Traditional Chinese it is possible to convert it to Simplified Chinese character for CPM to process https://github.com/BYVoid/OpenCC

Wonder if it can work. But do not know if I can download the CPM model, will they say we are "Foreign power" and restrict our access...?

Oh... May be CKIP is a better option then. https://ckip.iis.sinica.edu.tw/

NewJerseyStyle commented 3 years ago

I did a quick count on GPT-3's tagging of the comments in the Bowling Green dataset: https://gist.github.com/colinmegill/7714eb0962573346b210aa989e14dadf

Here are the human generated categories:
...
GPT-3 generated categories, and the number of times GPT-3 used the categories across all comments:
...
...

Hi @colinmegill I wonder... Are those GPT-3 tags generate automatically without any training/fine-tuning?

colinmegill commented 2 years ago

https://twitter.com/gdb/status/1486049720237629440

https://openai.com/blog/introducing-text-and-code-embeddings/

colinmegill commented 2 years ago

https://towardsdatascience.com/use-cases-of-googles-universal-sentence-encoder-in-production-dd5aaab4fc15#:~:text=The%20Universal%20Sentence%20Encoder%20encodes,and%20other%20natural%20language%20tasks.&text=It%20comes%20with%20two%20variations,Deep%20Averaging%20Network%20(DAN).

colinmegill commented 2 years ago

https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder

The Universal Sentence Encoder makes getting sentence level embeddings as easy as it has historically been to lookup the embeddings for individual words. The sentence embeddings can then be trivially used to compute sentence level meaning similarity as well as to enable better performance on downstream classification tasks using less supervised training data.

colinmegill commented 2 years ago

https://twitter.com/Nils_Reimers/status/1487014195568775173

NewJerseyStyle commented 2 years ago

https://twitter.com/Nils_Reimers/status/1487014195568775173

Wonder why GPT-3 do encoding things, isn't it the "Decoder" part of a transformer? I would use BERT things to do encoding as they are the "Encoder" part of a transformer. (They both train with a full transformer but keep different part as outcome if I understand correctly)

colinmegill commented 2 years ago

https://www.kaggle.com/kirankunapuli/universalsentenceencoderlarge5

because of its license, performance, size

metasoarous commented 2 years ago

What I've been thinking about in relation to this: We currently weigh 100 comments about a particular topic within a conversation the same as 100 comments on totally different (sub)topics of said conversation. This doesn't seem quite right, and poses a target for manipulation/biasing of the results towards a particular set of dimensions. If we were taking into account semantic similarity between comments, we could potentially correct for this..

(Digression: There's possibly an argument that number of comments on a topic is signal, so perhaps this correction could be quadratic, or more nuanced, by specifically looking for coordinated behavior, though this could require quite a bit of sophistication, and be difficult to verify the effectiveness of.)

It's been a real strength of the system thus far that it has relied 0% on NLP or language models, which has kept us robust and accessible for other languages. This would be a departure from that for us, and mean more work differentiating our processing between languages for which we have models and those for which we don't. But this may be a price worth paying.

patcon commented 2 years ago

It's been a real strength of the system thus far that it has relied 0% on NLP or language models

+1. Also, less room for skepticism. Democracy and delib assembly works is many way because it's [seems] mostly simple/comprehensible (right or wrong) -- it's counting or talking. Even the simple dimensional reduction of PCA has come up many times in casual discussion (with regular people about polis) as something they worry is too complex to depend on. I imagine that more complex roles for the black box of machine learning models -- that these could actually create a significant wedge of mistrust if not introduced thoughtfully/optionally.

colinmegill commented 2 years ago

I expect that polis will evaluate every major category of machine learning advancement as the field unfolds to consider how each might be applied making meaning of voices, but as the platform's core values include consistency, audit-ability and interpretability (much more work to do on interpretability, and expect that to be the case ongoing), I would expect anything deep learning related, like large language models, to be secondary or tertiary systems — not in the core flow of analysis, but as an assist, audited by human analysts. I think the best candidate for this is where the model is likely to be invisible, which is the case if we are performing very well at selecting a set of higher order categories and tags, something it's clear is having to be done in a manual step after many conversations.

NewJerseyStyle commented 2 years ago

From a technical perspective, Universal Sentence Encoder (USE) and BERT are great, both including the encoder part of transformer I think that explains why they are better on encoding strings with smaller size in comparing to GPT.

I have read through the paper comparing different sentence encoding model: http://ceur-ws.org/Vol-2431/paper2.pdf USE seem to be as good as BERT. In the paper there is a little trap in the benchmark if we are going to give a trial on it. Simply copy and paste may not work for us. BM25 algorithm will be helpful in classifying comments into categories, but the positive/negative feelings and the opinion about right/wrong may be ignored. Therefore, semantic analysis of comments may require another model to handle. (i.e. Two models are required to perform [task of categorizing comments] and [task to identify positive/negative right/wrong agree/disagree])

取得 Android 版 Outlookhttps://aka.ms/AAb9ysg

From: Colin Megill @.> Sent: Thursday, February 10, 2022 7:02:05 AM To: compdemocracy/polis @.> Cc: Stupid @.>; Comment @.> Subject: Re: [compdemocracy/polis] Integrate GPT-3 API (#507)

https://www.kaggle.com/kirankunapuli/universalsentenceencoderlarge5

— Reply to this email directly, view it on GitHubhttps://github.com/compdemocracy/polis/issues/507#issuecomment-1034289332, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFL2F47IWGW5ISKWR3EFUX3U2LW63ANCNFSM4PVDIITA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you commented.Message ID: @.***>

NewJerseyStyle commented 2 years ago

However, in the view of a product. It is true that for people who aware of this data collection may question about privacy and robustness. Human do afraid of things they do not understand.

I do understand so I am open to it. But I am not sure about others.

And for R&D. I saw this machine learning package. It seems interesting. https://microsoft.github.io/FLAML/ https://microsoft.github.io/FLAML/docs/Examples/AutoML-NLP AutoML - NLP | FLAMLhttps://microsoft.github.io/FLAML/docs/Examples/AutoML-NLP Requirements microsoft.github.io 

寄件者: Heoi David @.> 寄件日期: 2022年2月9日 23:18 收件者: compdemocracy/polis @.> 主旨: Re: [compdemocracy/polis] Integrate GPT-3 API (#507)

From a technical perspective, Universal Sentence Encoder (USE) and BERT are great, both including the encoder part of transformer I think that explains why they are better on encoding strings with smaller size in comparing to GPT.

I have read through the paper comparing different sentence encoding model: http://ceur-ws.org/Vol-2431/paper2.pdf USE seem to be as good as BERT. In the paper there is a little trap in the benchmark if we are going to give a trial on it. Simply copy and paste may not work for us. BM25 algorithm will be helpful in classifying comments into categories, but the positive/negative feelings and the opinion about right/wrong may be ignored. Therefore, semantic analysis of comments may require another model to handle. (i.e. Two models are required to perform [task of categorizing comments] and [task to identify positive/negative right/wrong agree/disagree])

取得 Android 版 Outlookhttps://aka.ms/AAb9ysg

From: Colin Megill @.> Sent: Thursday, February 10, 2022 7:02:05 AM To: compdemocracy/polis @.> Cc: Stupid @.>; Comment @.> Subject: Re: [compdemocracy/polis] Integrate GPT-3 API (#507)

https://www.kaggle.com/kirankunapuli/universalsentenceencoderlarge5

— Reply to this email directly, view it on GitHubhttps://github.com/compdemocracy/polis/issues/507#issuecomment-1034289332, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFL2F47IWGW5ISKWR3EFUX3U2LW63ANCNFSM4PVDIITA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you commented.Message ID: @.***>

metasoarous commented 2 years ago

Even the simple dimensional reduction of PCA has come up many times in casual discussion (with regular people about polis) as something they worry is too complex to depend on

I suppose we should just go back to using Survey Monkey then? :wink:

Seriously though, we're well past the point of "let's not put things in the system that the average person isn't familiar with or doesn't understand". If people want deliberative bodies without tech, they can do traditional citizen assemblies. Polis' explicit goal is scalable deliberation, and there's no way to get there without computational intelligence or machine learning.

I expect that polis will evaluate every major category of machine learning advancement as the field unfolds to consider how each might be applied [toward] making meaning of voices, but as the platform's core values include consistency, audit-ability and interpretability (much more work to do on interpretability, and expect that to be the case ongoing), I would expect anything deep learning related, like large language models, to be secondary or tertiary systems — not in the core flow of analysis, but as an assist, audited by human analysts

This :point_up: is how we want to be thinking about these decisions. We have to balance the goal of scalable meaning making with the demands of "consistency, audit-ability and interpretability".

I agree with Colin that NLP is unlikely to make it into the core system initially (or for some time really). However, as we explore and better understand techniques like this, we may find ourselves realizing that we can do a better job of scalable meaning making if we apply new techniques in the core platform. Keep in mind that our dimension reduction affects the way we route comments to participants; If we realize our dimension reduction is improved by (e.g.) considering semantic encoding, and are able to demonstrate that comment routing will be greatly improved as a result, we might decide that it's worth integrating these methods as part of the core infrastructure. But we wouldn't do that without carefully evaluating the consequences and carefully explaining our decisions and rationale to the public.

metasoarous commented 2 years ago

@NewJerseyStyle Thanks for your thoughts on this.

BM25 algorithm will be helpful in classifying comments into categories, but the positive/negative feelings and the opinion about right/wrong may be ignored.

That's good to know, and matches what I've seen in a lot of the topic modelling results. Thankfully though, we have participant votes which we can use to sort out valence ("side" of the topic) :slightly_smiling_face:

If you are already computing the covariance matrix to do PCA, you could use this to tell, for any given pair of semantically similar comments, whether the responses tended to concur or not. This shouldn't be a problem for doing post-conversation analysis, but we don't compute the covariance matrix in the core system, but may or may not be feasible as a core part of the system (will at least require additional thinking).

NewJerseyStyle commented 2 years ago

@metasoaroushttps://github.com/metasoarous I have some thoughts about the scalable deliberation.

I have been thinking about it for sometime during the social movements in Hong Kong. I found that many comments we made online only express our feelings. In other words, most of the comments were redundant.

My question is then, how can we distil information in a way that people can see the meaningful proposals made in comments? And in the meanwhile the solution can show the original texts as evidence of support/oppose for the proposal/main topic?

The solution I find in my case is: 1) Introduce a model to do clustering/summarize so readers won't be overwhelmed by millions of texts just telling you how much do they support/oppose the idea/topic. 2) Instead of showing raw texts, show the count of comments about support and oppose. Collapse the raw texts by default. 3) Comments with proposal will be showing alone instead of grouping into other support/oppose comment group. So people can see it, discuss it and evaluate it (and make it a new topic in future? make it the next move?).

Just some ideas how we can apply language models. Cheers =)

寄件者: Christopher Small @.***> 寄件日期: 2022年2月11日星期五 07:42 收件者: compdemocracy/polis 副本: Stupid; Comment 主旨: Re: [compdemocracy/polis] Integrate language models (#507)

Even the simple dimensional reduction of PCA has come up many times in casual discussion (with regular people about polis) as something they worry is too complex to depend on

I suppose we should just go back to using Survey Monkey then? 😉

Seriously though, we're well past the point of "let's not put things in the system that the average person isn't familiar with or doesn't understand". If people want deliberative bodies without tech, they can do traditional citizen assemblies. Polis' explicit goal is scalable deliberation, and there's no way to get there without computational intelligence or machine learning.

I expect that polis will evaluate every major category of machine learning advancement as the field unfolds to consider how each might be applied [toward] making meaning of voices, but as the platform's core values include consistency, audit-ability and interpretability (much more work to do on interpretability, and expect that to be the case ongoing), I would expect anything deep learning related, like large language models, to be secondary or tertiary systems — not in the core flow of analysis, but as an assist, audited by human analysts

This ☝️ is how we want to be thinking about these decisions. We have to balance the goal of scalable meaning making with the demands of "consistency, audit-ability and interpretability".

I agree with Colin that NLP is unlikely to make it into the core system initially (or for some time really). However, as we explore and better understand techniques like this, we may find ourselves realizing that we can do a better job of scalable meaning making if we apply new techniques in the core platform. Keep in mind that our dimension reduction affects the way we route comments to participants; If we realize our dimension reduction is improved by (e.g.) considering semantic encoding, and are able to demonstrate that comment routing will be greatly improved as a result, we might decide that it's worth integrating these methods as part of the core infrastructure. But we wouldn't do that without carefully evaluating the consequences and carefully explaining our decisions and rationale to the public.

— Reply to this email directly, view it on GitHubhttps://github.com/compdemocracy/polis/issues/507#issuecomment-1035647988, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFL2F45B2QIIUZOCV4GO7FTU2REO7ANCNFSM4PVDIITA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you commented.Message ID: @.***>

NewJerseyStyle commented 2 years ago

A language model demo has been made on 'american-assembly.bowling-green' comments

The model read through all the comments and created summary: Warren County's chief minister has announced a major shake-up of the county. Here is the full list of issues facing the city.

If filtered all moderated > 0 comments as input to the model, summary is Nashville needs a new bypass to improve traffic flow.

Wonder if this may be helpful. I will later this week create a repo to share the source code and test result on different openData comments.

I also plan to create a Q/A demo. It will read all comments and answer question (input) based on its reading. some days later.

TODOs:

[x] Test all openData create summary for comments based on topics
[x] Public repository sharing the Python source code of the summarization model
[x] Question answering model
[x] A demo webpage that answer question about topics based on comments the model reads

👋 After a thought, I am not being close enough to the development... Maybe these feature already been developed? Asking before I reinvent the wheel.

colinmegill commented 2 years ago

@NewJerseyStyle Not under development, this issue is still up to date and we are tracking here, we'll be interested to follow along!

NewJerseyStyle commented 2 years ago

@colinmegill Still have the link of my Colab? You can preview the test there before I finish all the tests and the repo things =)

colinmegill commented 2 years ago

I do :)

NewJerseyStyle commented 2 years ago

I do :)

Tips: Section Summarization ;)

NewJerseyStyle commented 2 years ago

A language model demo has been made on 'american-assembly.bowling-green' comments

The model read through all the comments and created summary: Warren County's chief minister has announced a major shake-up of the county. Here is the full list of issues facing the city.

If filtered all moderated > 0 comments as input to the model, summary is Nashville needs a new bypass to improve traffic flow.

The public repository sharing the Python source code of the summarization model and its outputs: https://github.com/NewJerseyStyle/Polis-openData-Summarization

NewJerseyStyle commented 2 years ago

Google Colab : Question Answer demo on Polis openData

I did not find a free hosting service capable of hosting such model and runs well... So I just leave it a Google Colab... :angel:

NewJerseyStyle commented 6 months ago

@colinmegill I met Seb in Civic Tech Toronto and come up a new idea about applying language model:

build a language model answers question about statistics on comments
let language model plot charts on the flight base on user input
the model will be under that PCA chart and allow user to ask question/request for other kind of charts

Assumption:

Polis do visualize data
Polis do data mining in a controlled way
Polis allow other data mining operation
User who have access to the comments is trustworthy

Idea was like integrating pandas-ai https://github.com/Sinaptik-AI/pandas-ai

NewJerseyStyle commented 6 months ago

@patcon Seb (we met in Civic Tech Toronto) mentioned about having autocomplete feature for typing comments like Google search autocomplete to reduce redundant comments. Is it a existing task on todo list?

I think it would be good to show similar comments with both metrics of hamming distance and semantic similarity.

compdemocracy / polis

Integrate language models #507

More potential use-cases