Closed shemmings6 closed 1 year ago
Hello!
I'm not sure what you are doing in RustrictFilter::is_censored
but, if I had to guess, you are checking if the uncensored text equals the censored text to determine validity. If true, this is problematic since the intended way to tell if text is "valid" is to check the Type
output of Censor::analyze
or Censor::censor_and_analyze
using Type::is
. The String
output of Censor::censor
or Censor::censor_and_analyze
is only intended to be a censored subset of the original input, and differences from the original input do not necessarily imply any particular reason for censoring. As you discovered in the documentation, a limitation of the current algorithm is removing diacritics while censoring. A possible workaround is to use the original text if the Type
is appropriate:
let input = String::from("ErnĂ©sto Jose DurĂ¡n Lar");
let (censored, analysis) = Censor::from_str(&input).censor_and_analyze();
let output = if analysis.is(Type::INAPPROPRIATE) {
// bad word was detected, only pass through the censored version (you could also reject the input and ask the user to try again)
censored
} else {
// no bad words detected, allow the input with accents
input
};
To be clear, there is a tradeoff. The above may allow some malicious inputs that exploit diacritics as part of the profanity.
Yes you're right, turns out we are doing a string comparison.
Are there any plans to not remove accent marks? Instead of returning sanitized input (stripping accents) - could it only return profanity modifications?
We will be moving forward with using the analysis approach for now though, thank you for your quick response!
Are there any plans to not remove accent marks?
I haven't found a way to preserve accents but, if and when I do, I'll make this change. The challenging part is that accents are removed in a pre-processing step, so the rest of the filter doesn't have to handle them. Either innocent accents need to be added back or not removed in the first place. Not removing accents seems preferable, but would mean the more complex parts of the filter would have to handle them.
the approach i ended up going with to try to avoid some malicious inputs like you mentioned was to make is_censored
function like this:
fn is_censored(&self, input: &str, severity: Severity) -> bool {
!self
.filter(input, severity)
.eq(&diacritics::remove_diacritics(input))
}
Motivation
It is causing valid names to be marked as censored
Summary
While investigating why words with accented characters were marked as censored, I noticed this comment on the censor function
i made a test case to see the difference in text once run through
censor
and what it was marked asThis outputs a failed test on the assert that the name is not censored:
Could a change be made so that at minimum, this is not considered censored text?
Alternatives
Accented characters are not removed at all and they are not marked as censored
Context
I am using
rustrict
version0.4.0