Watts-Lab / commonsense-statements

0 stars 0 forks source link

GPT Statement cleaning #9

Open markwhiting opened 9 months ago

markwhiting commented 9 months ago

Run multi stage pipeline for getting only very clean statements.

They should be clear, and make sense.

Pipeline:

Meta tasks:

amirrr commented 6 months ago

Lets have table with the cleanest statements from the GPT pipeline,

  1. statement (text - id)
  2. design point
  3. quantile rank within design point (design point quantile for commonsensicality within that design point)
  4. source
joshnguyen99 commented 5 months ago

Here's what I think will work. Proper nouns and names can be detected by spacy, even when they're in lower case.

For example, given the statement with ID = 1361, namely if jake considers john's example he would become the strongest and fittest person ever, if I do

import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("if jake considers john's example he would become the strongest and fittest person ever")
for tok in doc:
    if tok.pos_ == "PROPN":
        print(tok)

The output would be

jake
john

From this we can do 2 things:

  1. Apply a simple (heuristic) rule to capitalize these proper nouns.
  2. Do it ourselves. I'm guessing the number of statements won't be to large for us to handle manually for this 4k corpus.
markwhiting commented 5 months ago

Cool, I think removing them with GPT is mostly ok, except when it removes them too much, in which case it tends to make the statement meaningless, but in most cases, those are actually statements we should drop. e.g., Florida is a nice placethis state is a nice place, and both of those are not particularly useful statements. So I think what Amir is doing is sufficient. But I think we may need to continue to think about refining our filtering.