Closed TheLandfill closed 3 years ago
We could use regex to filter all that stuff out as long as we have a list of what to look for.
im somewhat decent with regex, if help is needed im available.
Relatedly, while injecting the names of state Republicans is clever, it also makes for a pretty easy way to filter out submissions coming from this form. I think sticking with the random names (and maybe increasing the number of them) would result in harder-to-filter noise.
I would say that filtering out quotation marks should be the last step in general since you can use them as hints for other regular expressions, like long quotes in a sentence.
@ChrisBremseth The best way to generate such a list is to generate a bunch of responses from GPT2 and manually look for patterns, which is how I generated my list.
We have moved active development to these repos:
Could you recreate this issue in the appropriate new repo? I think AbBOT-python is the best bet.
Thanks!
I want to note that GPT2 output, while decent, usually has a ton of telltale signs:
We should try to remove as many of these signs as possible from the output of GPT2. Some of these are easier to remove than others, but the ones that are difficult for us to remove are also difficult for them to filter out. For example, they could remove most of GPT2's submissions by removing any submission with double quotes.
It might also be a good idea to leave these signs in the output every so often so that they implement filters that could remove false negatives. For example, if they implement a filter for quotation marks, they could remove some real submissions that have quotation marks.