Assignments for session 12

mictebbe commented 3 years ago

Paper reading Tommi Gröndahl, Luca Pajola, Mika Juuti, Mauro Conti, and N. Asokan. 2018. All You Need is "Love": Evading Hate Speech Detection. In Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security (AISec '18). Association for Computing Machinery, New York, NY, USA, 2–12. DOI:https://doi.org/10.1145/3270101.3270103

1.1 Discussion questions Write down 3 questions that came up while reading the paper that would be interesting to discuss in the next session. Post your Questions on GitHub as comments under the assignment.

Time slots project presentation: Find a slot to present you project in session 13 (11.02.) or 14 (18.02.): https://docs.google.com/spreadsheets/d/1DdkST3KZV4x9D5nGsHgevIASmu_rFkK0Bx2r4AeBGPE/edit#gid=1895482106

Rahaf66 commented 3 years ago

The paper mentioned the problem of the conflation between hate and offensive speech, whereas the dataset T1 has three labels (hate speech, offensive, and ordinary). I think that the rules of distinguishing these classes are subjective and sensitive to many factors that could also differ over time. what are the criteria that differ between hate and offensive speech? are the cross_cultural differences taken into consideration?

The paper mentioned also the influence of the imbalanced classes on the performance of the classifiers, could expansion of the training data solve this problem in practice (in our case, by adding more hate speech comments)?

A very important point that has been broached is the necessity of more focus on the datasets and qualitative analysis rather than models, when we talk about YouTube, Facebook, or Twitter, how efficient is the role of the content moderators from this perspective?

yaozheng600 commented 3 years ago

question 1: In section 2.1 the referenced paper mentioned that by using LSTM+GBDT they get a better result than which only using LSTM. But the author got the opposite result i.e. LSTM better than LSTM+GBDT. Why the more complex model get a bad result? Is that the problem of the model itself or it depends on the datasets they use? A possible reason that I think is, maybe the model is overfitting?

question 2: In section 3.3 the author used the word appending method in the Adversarial training. It is certainly useful to add common words to the hate speech and make the dataset more general. But on the other side, If you add hate words to the common sentences, actually the whole speech turns into the "hate speech" class. So does it make sense? what's the meaning of adding hate words to normal speech? I personally don't think it could be useful for getting a better result.

question 3: In section 8 the author mentioned that we should focus more on the dataset instead of the models. What should we do exactly about it? Are there any ideas?

iraari commented 3 years ago

Where should the boundary between true (hate speech) and false positives (offensive speech) lie? Is it even possible to set this boundary?
What other evasion attacks can be used? For example, after the problem with simple text modification attacks described in the article will be solved.
What could be done to prevent breaking word-models with only one word that negatively correlate with hate speech? ("love" in the article)

travela commented 3 years ago

Shouldn't we, before training and evaluating models, clarify what kind of offensive speech should be censored? Can we justify false positives for fewer false negatives (i.e. what is more important, free speech or hate-free communities)?
In addition to the above: Does the assumption hold that a distinction between offensive speech and hate speech is appropriate (given that this distinction is subjective)?
What happens if we combine all data sets for training and only leave one data set for testing? This should force the model to generalize more and be trained with more heterogeneous data (since the data sets were noted to be the main issues).

raupy commented 3 years ago

From a philosophical point of view I think it is very beautiful that "love always wins" but from a technical point of view I think it is strange that the word love has such a big impact on reducing the toxicity score. I think it should have no impact on the toxicity but rather have an impact on the "love score": So maybe - although the researchers would focus on the training data - a different model architecture would be an idea? Like with multiple output channel for hate and love (or hope or joy or whatever). But of course I don't know anything on the model architecture, so maybe I don't know what I am talking about.
The researchers would rather focus on better training data sets. Are there any ideas about how the data should look like?
I would like to see more examples for hateful speech and offensive speech. For me it seems difficult to set a strict boundary between these two classes. And "offensive" does not sound respectful anyway, so maybe it should be avoided too?

satorus commented 3 years ago

Seeing as simple white space removeal effectively kills word-based detection, could one say that word-based hate-speech detection is not really feasible because it is easily tricked by just every average user?
In the case of white space removal...i don't think it is always that easy to reproduce the original content from this for the reader, as removing all whitespaces can make the text very hard to read, so i dont think it is such an easy to use technique for an adversary as the authors want us to believe
Adding love to the sentence makes it indeed less toxic/hateful for a ML algorithm because from a pure numerical/statistical standpoint, love is an anti-hate/toxic word, so it seems logical to reduce the toxicity score for that sentence then. Are there any ideas currently how to resolve such cases, as context gets very important here, can ML work with this?

mrtobie commented 3 years ago

The threat of false positive results is very high considering that these algorithms will sooner or later be used unsupervised in social networks (such as Facebook). The German NetzDG (Network enforcement act) forces these social networks to delete fake news and hate speech. If these algorithms are quite easy to attack and the deletion of false positives could lead to violation of the freedom of speech is it really desirable to use algorithms for that?
Since char-based algorithms are much more resistible against attacks why are word based even used? Is the performance so much better?
A lot in this paper surrounds the topic of the datasets. It was mentioned that the used data is labelled manually by different people. The authors of the paper believed that there are great differences in the dataset based on whether the data was labelled in 2 groups or in 3. Wouldn't this assumption imply that all researchers would label all data the same based on whether it is divided in 2 or 3 groups? From my perspective the impact of the decisions that the "labellers" make (due to not having a strong definition of hate speech) should have a much higher impact on the datasets than the fact that some of the data ist divided the non-state speech further.

Aylin-00 commented 3 years ago

Assuming that hate speech often reflects real life events, could one manually find trends in hate speech, so that the models can pay attention to that?
Why was not syntax considered? Is it because syntax-detection gets confused easily? Adding words like „love“ or just using cursing word before a non-hateful verb could maybe be detected
Why is not meta-data used to classify hate speech. In a video about a terrorist attack on mosques one would expect that anti-muslim hate speech occurs, but not vaccination related hate speech. Maybe the classifier can pay special attention to hateful words in combination with words related to muslims or Islam.

Francosinus commented 3 years ago

Since words like "love" or the "F" word can strongly affect the prediction of the text, how can one train the classifier to differentiate between positively and negatively annotated swear words for example? Or rather how can a model be taught to understand the connections between different words?
Training and testing the classifiers with the original datasets resulted in good predictions, but using different test sets the F1 scores drastically deteriorated. Language models usually have to be fine-tuned for special data sets or maybe even platform-specific. Would it make sense to train on positive or negative words first and then to adjust the model later?
Adversarial attacks can have a huge impact on the models' performance. I once did that with an image classifier by adding some noise to the pictures. With the human eye, it could be clearly recognized but the machine totally failed. Regarding a language model should errors such as spelling mistakes be included in the model planning phase? How can one prevent such attacks?

ChristyLau commented 3 years ago

What are the criteria to differentiate hate speech and offensive speech?
Is it possible to add the cultural component to the Hate Speech Detector, i.e. religious taboos and cultural taboos. And how should we realize it?
This detector is unable to detect the homophobic Hate Speech, which has been observed in many social media platform. How did the scholars deal with this kind of speech?

adrigru commented 3 years ago

Question 1: What characterises offensive speech? What are the boundaries between offensive and hate speech?

Question 2: How we can construct datasets comprising large variety of hate speech variants from diverse sources? Also, how do we label the data since offensive/hate speech is a subjective matter.

Question 3: In the introduction the authors claim that "hate speech detection is largely independent of model architecture." However, in section 4 they say that model selection has influence on model performance in terms of attack resilience. How does this two statements might be true?

yuxin16 commented 3 years ago

1.How to set clear boundaries between false positive and false negative? 2.They mentioned that based on the asymmetry of hate speech classification, the solution might be to reintroduce more traditional keyword-based approaches. Will the keyword-based approach not be too simple to detect hate speech efficiently? If not, how will it works? Should the keyword-based approach be combined with other techniques?

They mentioned that hate speech are highly context-dependent. What does it mean by context here? Context based on surrounding discussion or context based on the discussed topic?

budmil commented 3 years ago

Some examples of ideas how to differentiate between hate speech and offensive talk?
This is all about English language, are there any such discussions for other languages?
A suggestion: MIT tech review's podcast "In machines we trust" has two episodes where Twitter's and Facebook's CTOs talk about dealing with misinformation on their platforms, and they also get to hate speech management.

DanielKirchner commented 3 years ago

Where are the different hate speech detection models used? Are they used in the big Social Media companies? For every comment that is posted or like only for those that are reported by other users?
Apropos reporting, human users would still be able to report comments that used some of the presented attacks for avoiding being detected. So how big is the impact of such attacks anyway?
A few more examples would be nice

Moritzw commented 3 years ago

the authors note that the models classifies on prevalence and therefore can make false classification if sufficient test from the opposing class is added. are there any current approaches which analyze the text in parts (maybe in sentences) and classify the text as hate speech if one part is hate speech?
False positives are quite common with these methods in either direction. Would it be more desirable to have the majority of false positives be classified as hate speech and thus removing them or is free speech more important even if it means more hate speech is unrecognized?
In the related work section the authors note that the appended "good words" problem is known from spam filters. How would adding such a spam filter approach as a pre-processing step affect the quality of the classification? And should detected text with “good words” appended automatically be classified as hate speech?

milanbargiel commented 3 years ago

1. How were the performances of the models finally evaluated? By human beings cross-checking results?

2. A result of this study is, that all models somehow are equally "good" in classifying hate-speech. Therefore, according to the authors, the focus of future research should be on the datasets instead of the models. How would that work? If all models have difficulties classifying new content, how can you improve the classification of hate speech by improving the data set?

3. To what extent could transfer learning help to make the models more efficient?

SabinaP17 commented 3 years ago

1. The paper raises the problem of properly distinguishing between the concepts of hate speech and offensive speech. Although a definition of the former was offered, a clear delimitation between these two concepts was not presented in the paper. It was also mentioned that upon testing each two-class model on offensive ordinary speech, they proved to be susceptible to false positives. What are the criteria applied to distinguish between these two concepts? Can these criteria be considered to be cross-cultural? More examples of words which can be classified in these two groups would be helpful.

2. Regarding the classification of data in hate- and offensive-speech, it seemed to me that this is still rather a subjective matter, basing my assumption on the models of different researchers used in the paper. How can data pertaining to either of these two groups be labeled more efficiently?

3. The paper showed that appending the text with words like "love" or "F" directly affect the toxicity score of a sentence, leading to false positives or false prediction for sentences. How can a classifier be trained to correctly differentiate between the contexts in which such words ("love", "F", etc.) are used so that for example non-hateful sentences which contain an "F" word for example, are not predicted to be hateful?

jtembrockhaus commented 3 years ago

1. The authors showed that adversarial attack strategies are very efficient against hate speech classification models. They further discussed the positive impact of adversarial training to prevent misclassification of altered samples. Since one can only add adversarial training samples for known adversarial attack strategies, could you imagine other adversarial attacks than those described in the paper?

2. It was said that all models performed equally well, when they were tested on similar data they were previously trained on. However, when the models were trained on one data set (e.g T1) and then tested on another (e.g T2) the performance was massively reduced. Imagine one would combine the models (all trained on different data) by applying them separately on an arbitrary data set and implement a majority vote for the final classification. Do you think it would result in better classification accuracies?

3. Word-based and character-based approaches differ completely their building structure. Which of the two do you think is the more promising strategy for the future and why?

anastasiia-todoshchuk commented 3 years ago

1) Could the performance drop on the different data types be the consequence of overfitting?

2) Are there any ways to deal with the described adversaries?

3) What's the main difference between the researchers’ solution and Google’s one?

alexkhrustalev commented 3 years ago

1. Taking into account the design of a model, what could be the reason for the character-level features to overperform the world-level ones? 2. How these seven models were chosen? 3. What contributes more to the results - the data type or the labeling criteria?

Cloudz333 commented 3 years ago

As mentioned in the Paper: "Hate speech is not a universal concept.". So it would be interesting to know how this models can deal with the Bias introduced in the training, or how humans can deal with it once those models are deployed. I can imagine that this could represent a threat to the freedom of speech: E.g. if the word "Trump" is associated to a certain degree of toxicity, while the word "Obama" is not.
Related to the previous question, it would be also good to know with which criteria the data is labeled. Can data labeling help solve this problem? If yes, to what extent?
A distinction is made between hateful and offensive speech. Simply reading the article it is not very clear to me what the differences between these two classes are. But I suppose this distinction increases the complexity of the labeling process, which is perhaps already quite complex. Why make this distinction?

FUB-HCC / seminar_critical-social-media-analysis

Assignments for session 12 #41