finnbear / rustrict

rustrict is a profanity filter for Rust
https://crates.io/crates/rustrict
MIT License
94 stars 10 forks source link

Filter falsely detects characters at the end of swears as part of the swear #24

Closed callmeclover closed 8 months ago

callmeclover commented 8 months ago

(This was taken from this bartender issue.)

If we try to run censor on the text "fuck", it returns "f*" as expected. The issue arises, however, when we attempt to implement a markdown parser like Earmark. This runs us into a predicament, because if we run censor before we run as_html, it would parse the censor as markdown. If we run censor after we run as_html, it would parse self censoring as markdown in some cases, sabotaging the filter. Running it after also makes "\

fuck\

" return "\

f**/p>". We could try to make it run an iterator over some html to censor it, but then we would run into the same issue with self censoring.

I will probably just remove these tags that Earmark/Pulldown adds. I am just posting this issue so that it's known.

EDIT: Version is 0.7.24

finnbear commented 8 months ago

Profanity filtering only makes sense for plaintext, not markdown or HTML. If your HTML contains unbroken sentences, I would suggest using the visitor pattern to censor them all. E.g. <div><h1>Hello</h1><div><p>Hi</p><p>Crap</p></div></div> would produce 3 calls to censor, one per text node. This is not trivial if you have rich text e.g. <p>Hi <b>as</b>shole</p> which breaks up profanities or false positives.

it would parse self censoring as markdown in some cases

A possible workaround for this one specific issue is to change the replacement character:

use rustrict::{Censor, Type};

let (censored, analysis) = Censor::from_str("fuck")
    .with_censor_replacement('%') // some non-markdown character
    .censor_and_analyze();

assert_eq!(censored, "f%%%");

You could also choose a character like '\u{FFFD}' and then replace it back to * after markdown runs.

Filter falsely detects characters at the end of swears as part of the swear #24

To address your issue title, the filter is working as intended as < looks like another c. You can change the replacements to get around this instance or stop using replacements entirely with an empty Replacements.

callmeclover commented 8 months ago

Yes this was one of my original ideas, but upon closer inspection it will still censor characters like '%' or '('. I think a good idea is to implement a separate parser for markdown (on my end in Bartender.) I will close this issue once it is implemented.

callmeclover commented 8 months ago

Quick question, though. How do I run censor over all the text nodes? The CensorIter trait only supports chars, and I don't know if it's possible to put the vector from scraper of the text nodes into chars, censor it with the iterator, and then put it back. Unless you're referring to a for loop, but wouldn't that just result in the censoring being cut off?

finnbear commented 8 months ago

How do I run censor over all the text nodes?

One at a time. You can call censor once per text node. As for how you visit all the text nodes individually, that will depend on your HTML datastructure. If your markdown parser returns HTML as a string, I guess you would first need to parse it into a datastructure, do the censoring (recursively), and then convert it back to HTML.

callmeclover commented 8 months ago

Would it be a smart idea to run a for in loop on the html string after parsing using something like html5ever, and just replace text? Is there a better way to implement something? Would this still work for things that cut off profanities, e.g <p>as<em>shole</em></p>?

finnbear commented 8 months ago

Would it be a smart idea to run a for in loop on the html string after parsing using something like html5ever, and just replace text?

Seems tricky to synchronize the input and output of the censor iterator to reassemble the HTML after censoring. E.g. if censor iterates 'a', 's', 's', 'h', 'o', 'l', and 'e' and outputs 'a', '*', '*', '*', '*', '*', and '*', how do you know which of the '*' go inside the <em>? It seems like you could just count the characters but this is not the case, as censoring may remove characters (diacritics, aka accents, and others).

Would this still work for things that cut off profanities

My suggested approach would not, there are two text nodes in <p>as<em>shole</em></p> and neither is intrinsically profane. Yours could handle this assuming you got it to work.

By the way, either way you go, you will need to unescape HTML (unless your parser supports this) and re-escape HTML (unless your formatter supports this).

callmeclover commented 8 months ago

I do have a way, which goes from an element (<p>as<em>shole</em></p>) to a Vec (["as", "shole"]) which then branches off to convert to a list of chars which gets censored (['a', '*', '*', '*', '*', '*', '*']), while the other remains the same, and then, they are run through this while loop:

while index < nodes.len() {
    let replacement: String = chars.chunks_exact(nodes[index].len()).next().unwrap().iter().collect();
    nodes[index] = replacement;
    chars.drain(0..nodes[index].len());
    index += 1;
}

There's probably an easier way to do this, but I digress. What I don't know how to do yet is convert this back into HTML.

finnbear commented 8 months ago

What I don't know how to do yet is convert this back into HTML.

Yeah, that's the tricky part. Your vec is a lossy representation of the HTML. My recommendation was to censor in place, without changing the hierarchy (even if it means not censoring over rich text boundaries like <em>).

callmeclover commented 8 months ago

There is probably some way to do it, maybe storing the tag value? Or, perhaps index of the text nodes in the DOM tree? I think the ladder is a better option; I'm going to try it when I get back to my workstation.

finnbear commented 8 months ago

Good luck with your use-case! Considering I won't be attempting to make rustrict ingore/preserve Markdown/HTML structures, this issue is now out of scope.

You would have a very similar problem if you tried to run the following much simpler function on your text:

// Input:  <p>f<em>ö</em>o</p>
// Output: <p>f<em>*</em>*</p>
fn censor_html(html: &str) -> String {
     // TODO: Parse the DOM
     // TODO: For all text, run censor(text)
     // TODO: Format the DOM
}

fn censor(s: &str) -> String {
    s.replace("föo", "f**")
}

(note: &str vs impl Iterator<Item = char> is only a performance difference, doesn't change behavior)

As such, if your tag/index approach doesn't work out, you can raise this issue with your Markdown or HTML library maintainers and use the above simpler function as an example use-case.