Roadmap for Multi-language support

mkadirtan commented 1 year ago

Are there any plans for:

adding multilanguage support
adding right-to-left text support

finnbear commented 1 year ago

Hi :wave:

I'm gradually adding profanities in other languages, including Spanish, French, Italian, Russian, and Chinese.

This process is limited by the rate at which I become aware of them and my limited confidence in tools like Google translate.
At the moment, there isn't automatic support for detecting false positives like there is with English. In other words, Spanish words are subject to the Scunthorpe problem. Fixing this would require adding and filtering more dictionaries.
If there are some words you think should be added ASAP, regardless of the language, please open an issue. Worst case I postpone censoring them until I can improve the filter.

Right now, the filter has a fail-safe to simply filter out right-to-left markers to force text to be left-to-right. Proper right-to-left support sounds tricky, because the filter makes a single left-to-right pass through the input.

In an abstract sense, it might be possible to reverse the characters in each profanity. However, if the direction is flipped within a word, this method couldn't detect it.
Another possible method would be to preprocess the input to order all characters in the order they should be displayed, do the filtering, and then somehow undo the reordering. This would require O(n) time and allocation.

mkadirtan commented 1 year ago

Hi, thanks for the quick reply!

I didn't know you were already adding profanities for other languages.

Currently, I am creating an npm package for this. I was planning to provide an API similar to this:

censor.addWords('en', ['bad', 'word']);
censor.filter('Sentence with bad word', { locale: 'en' });

I couldn't find a way to switch between languages. Maybe, I need to instantiate multiple censors but I'm not sure.

mkadirtan commented 1 year ago

Hi, thanks for the quick reply!

I didn't know you were already adding profanities for other languages.

Currently, I am creating an npm package for this. I was planning to provide an API similar to this:
censor.addWords('en', ['bad', 'word']);
censor.filter('Sentence with bad word', { locale: 'en' }); 
I couldn't find a way to switch between languages. Maybe, I need to instantiate multiple censors but I'm not sure.

NPM package is here btw, I've just set up bindings and will try to provide an API as close to the rust version as possible https://github.com/nooptoday/profanity-filter

finnbear commented 1 year ago

Hi, thanks for the quick reply! I didn't know you were already adding profanities for other languages. Currently, I am creating an npm package for this. I was planning to provide an API similar to this:
censor.addWords('en', ['bad', 'word']);
censor.filter('Sentence with bad word', { locale: 'en' }); 
I couldn't find a way to switch between languages. Maybe, I need to instantiate multiple censors but I'm not sure.
NPM package is here btw, I've just set up bindings and will try to provide an API as close to the rust version as possible https://github.com/nooptoday/profanity-filter

That's awesome!

To avoid any misunderstanding, note that rustrict doesn't know/care about the language/locale of the input or the profanities. It will filter out all known profanities all the time. This approach isn't without issues, but assuming the input is a single language allows trivial abuse by using multiple languages in a message.

For addWords, note that there are two available APIs at the moment:

Trie::customize_default().set("word", Type::PROFANE & Type::SEVERE) (this requires the customize feature and must not be executed concurrently with filtering; the benefit is that it affects the default word list for future calls e.g. .censor())
Censor::with_trie(Box::leak({let mut trie = Trie::default(); trie.set("word", Type::PROFANE & Type::SEVERE); trie}))

I hope to add a more ergonomic API that doesn't require 'static lifetimes to modify the word list.

mkadirtan commented 1 year ago

For addWords, note that there are two available APIs at the moment:

Trie::customize_default().set("word", Type::PROFANE & Type::SEVERE) (this requires the customize feature and must not be executed concurrently with filtering; the benefit is that it affects the default word list for future calls e.g. .censor())

Censor::with_trie(Box::leak({let mut trie = Trie::default(); trie.set("word", Type::PROFANE & Type::SEVERE); trie}))

I hope to add a more ergonomic API that doesn't require 'static lifetimes to modify the word list.

The code with Box::leak gave the following error:

mismatched types [E0308] expected `Box<Censor<<unknown>>, <unknown>>`, found `Trie`

I'm not proficient in Rust to solve this problem :(

I was planning to create multiple instances with different tries so that multilanguage support can be added outside this library.

mkadirtan commented 1 year ago

pub fn censorTurkish(input: String) -> String {
    let mut turkish_trie = Trie::new();
    turkish_trie.set("küfür", Type::MEAN);

    Censor::from_str(input.as_str()).with_trie(&turkish_trie).censor()
}

I think I should manage different tries for different languages and use them when I want to check specifically for that language, is that right?

Also, If I create tries in a shared data and access them everytime with Censor::from_str(input.as_str()).with_trie(&custom_trie).censor() does that cause any performance issues or exceptions?

finnbear commented 1 year ago

pub fn censorTurkish(input: String) -> String {
    let mut turkish_trie = Trie::new();
    turkish_trie.set("küfür", Type::MEAN);

    Censor::from_str(input.as_str()).with_trie(&turkish_trie).censor()
}
I think I should manage different tries for different languages and use them when I want to check specifically for that language, is that right?

Yeah, that will work. (see my note about Box::leak at the end)

Also, If I create tries in a shared data and access them everytime with Censor::from_str(input.as_str()).with_trie(&custom_trie).censor() does that cause any performance issues or exceptions?

Every time you create and use a new Censor (such as using that code), a few heap allocations happen. This may not be critical but you can use a global singleton instance and the reset function in between uses.

The other issue, which you will see at compile time, is that with_trie currently wants a static reference. You can use Box::leak (hopefully a finite number of times) to make one from an owned Trie. This may be relaxed in a future version.

mkadirtan commented 1 year ago

pub fn censorTurkish(input: String) -> String {
    let mut turkish_trie = Trie::new();
    turkish_trie.set("küfür", Type::MEAN);

    Censor::from_str(input.as_str()).with_trie(&turkish_trie).censor()
}
I think I should manage different tries for different languages and use them when I want to check specifically for that language, is that right?
Yeah, that will work. (see my note about Box::leak at the end)

Also, If I create tries in a shared data and access them everytime with Censor::from_str(input.as_str()).with_trie(&custom_trie).censor() does that cause any performance issues or exceptions?

Every time you create and use a new Censor (such as using that code), a few heap allocations happen. This may not be critical but you can use a global singleton instance and the reset function in between uses.

The other issue, which you will see at compile time, is that with_trie currently wants a static reference. You can use Box::leak (hopefully a finite number of times) to make one from an owned Trie. This may be relaxed in a future version.

Thanks for the reply, I hope I will understand your comments as I learn more about Rust 😓

I think this issue can be closed. From what I understand, you are already planning to implement multilanguage support, but it is not very straightforward and will take time. In the mean time, there is an easy workaround with with_trie method

finnbear commented 1 year ago

you are already planning to implement multilanguage support

I'm planning to continue adding profanity from other languages as I become aware of it and, eventually, add automatic false-positive detection. I'm not currently planning to make language/locale a parameter to Censor so, under my current plan, the filter will default to censoring out all known profanity regardless of language of profanity or language(s) of input. Users are free to construct their own language-specific Trie's, though!

I think this issue can be closed.

Alright, I will close the issue. Please open another if you have more questions.

finnbear / rustrict

Roadmap for Multi-language support #9