SeanDaBlack / AbBOT

308 stars 55 forks source link

Telltale Signs of Output from GPT2 (Double Quotes in Particular) #46

Closed TheLandfill closed 3 years ago

TheLandfill commented 3 years ago

I want to note that GPT2 output, while decent, usually has a ton of telltale signs:

  1. The frequent use of double quotes, as if it's quoting from an interview. Almost every paragraph uses double quotes at least once, especially at the end of the first paragraph.
  2. Long quotations inside a sentence. Removing these requires more advanced filtering, but something along the lines of removing any sentence with a quote and the word "said."
  3. The use of advertisement text. Because of how GPT-2 was trained, it sometimes has words like "Advertisement" on its own line.
  4. Any use of colons for things like quotations in interviews.
  5. Characters that are not found on a standard keyboard.
  6. Stuff like "All photos © Michael C. Stough" on its own line.
  7. While you remove the stuff that comes after the final period, you don't remove any trailing newlines afterwards.

We should try to remove as many of these signs as possible from the output of GPT2. Some of these are easier to remove than others, but the ones that are difficult for us to remove are also difficult for them to filter out. For example, they could remove most of GPT2's submissions by removing any submission with double quotes.

It might also be a good idea to leave these signs in the output every so often so that they implement filters that could remove false negatives. For example, if they implement a filter for quotation marks, they could remove some real submissions that have quotation marks.

ghost commented 3 years ago

We could use regex to filter all that stuff out as long as we have a list of what to look for.

LakesideMiners commented 3 years ago

im somewhat decent with regex, if help is needed im available.

rootwork commented 3 years ago

Relatedly, while injecting the names of state Republicans is clever, it also makes for a pretty easy way to filter out submissions coming from this form. I think sticking with the random names (and maybe increasing the number of them) would result in harder-to-filter noise.

TheLandfill commented 3 years ago

I would say that filtering out quotation marks should be the last step in general since you can use them as hints for other regular expressions, like long quotes in a sentence.

@ChrisBremseth The best way to generate such a list is to generate a bunch of responses from GPT2 and manually look for patterns, which is how I generated my list.

ramblingjordan commented 3 years ago

We have moved active development to these repos:

Could you recreate this issue in the appropriate new repo? I think AbBOT-python is the best bet.

Thanks!