Open cdolfi opened 4 years ago
@cdolfi is there anything special about the types of offensive language or hate speech that is specific to the fedora mailing list? If not, then you could probably look for another external dataset (here is one possible example) that labels emails or tweets as hateful/offensive to train your model. Maybe I don't fully understand the goal of this project : ) but if developing a hateful language detector is the goal, that can then be applied to the Fedora mailing list, its not clear to me that the fedora mailing list would be the best source for training, due to the fact that it probably has a low occurrence of things like offensive language or hate speech (I'm guessing)
@MichaelClifford On the mailing list, the only thing I found to be unique about it compared to many data sets online was the style of communication. People communicate very differently on a mailing list than they do on a twitter feed. Another option I have been considering is using some public data set like you have above and some data set unique to communicating in a semi professional setting as the fedora mailing list has. My biggest concern on using a twitter data set is that it will not detect the hateful or discriminatory language on the mailing list as the way its written is different.
With the late discovery on the data, the cleaned data set is now very different from the labeled data from September. The benefit of this data is that it has over 1000 labeled emails from about 20 different people contributing. I am looking for suggestions on different ways to go about handling the data issue. A few options could be:
Here is the old labeled: https://docs.google.com/spreadsheets/d/1Th2I1tgG0ivvV-Ubs-7ehVMor__tgCMU3SUpHWfXQmY/edit?usp=sharing
Reach out to me at cdolfi@redhat.com for information to get access to the new clean labeled data from my bucket
Acceptance Criteria: