GPT Statement cleaning - Githubissues

Watts-Lab / commonsense-statements

0 stars 0 forks source link

GPT Statement cleaning #9

Open markwhiting opened 9 months ago

markwhiting commented 9 months ago

Run multi stage pipeline for getting only very clean statements.

They should be clear, and make sense.

Pipeline:

[ ] Ask GPT to filter generally
[ ] Filter for strange proper nouns
[ ] Filter for normal sentences
[x] Convert all names to gender non specific names, e.g., Max, Alex, Sam

Meta tasks:

[ ] Do some testing to compare before and after commonsense and labels of statements
[ ] Commit pipeline as an action in statements repo (whatever is fine but something that lets us update continuously and something that will be low effort to maintain)
[ ] Update statements repo

amirrr commented 6 months ago

Lets have table with the cleanest statements from the GPT pipeline,

statement (text - id)
design point
quantile rank within design point (design point quantile for commonsensicality within that design point)
source

joshnguyen99 commented 5 months ago

Here's what I think will work. Proper nouns and names can be detected by spacy, even when they're in lower case.

For example, given the statement with ID = 1361, namely if jake considers john's example he would become the strongest and fittest person ever, if I do

import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("if jake considers john's example he would become the strongest and fittest person ever")
for tok in doc:
    if tok.pos_ == "PROPN":
        print(tok)

The output would be

jake
john

From this we can do 2 things:

Apply a simple (heuristic) rule to capitalize these proper nouns.
Do it ourselves. I'm guessing the number of statements won't be to large for us to handle manually for this 4k corpus.

markwhiting commented 5 months ago

Cool, I think removing them with GPT is mostly ok, except when it removes them too much, in which case it tends to make the statement meaningless, but in most cases, those are actually statements we should drop. e.g., Florida is a nice place → this state is a nice place, and both of those are not particularly useful statements. So I think what Amir is doing is sufficient. But I think we may need to continue to think about refining our filtering.