Open markwhiting opened 9 months ago
Lets have table with the cleanest statements from the GPT pipeline,
Here's what I think will work. Proper nouns and names can be detected by spacy
, even when they're in lower case.
For example, given the statement with ID = 1361
, namely if jake considers john's example he would become the strongest and fittest person ever
, if I do
import spacy
nlp = spacy.load("en_core_web_trf")
doc = nlp("if jake considers john's example he would become the strongest and fittest person ever")
for tok in doc:
if tok.pos_ == "PROPN":
print(tok)
The output would be
jake
john
From this we can do 2 things:
Cool, I think removing them with GPT is mostly ok, except when it removes them too much, in which case it tends to make the statement meaningless, but in most cases, those are actually statements we should drop. e.g., Florida is a nice place
→ this state is a nice place
, and both of those are not particularly useful statements. So I think what Amir is doing is sufficient. But I think we may need to continue to think about refining our filtering.
Run multi stage pipeline for getting only very clean statements.
They should be clear, and make sense.
Pipeline:
Meta tasks: