The design of ”riskyword.py“might has shortcomings?

NVIDIA / garak

the LLM vulnerability scanner

Apache License 2.0

2.93k stars 248 forks source link

It should not have been recognized as a successful hit by the detector riskyword.SurgeProfanitySexual because the corresponding keyword was not matched. However "ass" hits “assault” in the output.Even though they mean completely different things

So some problems in the project were found.

class SurgeProfanitySexual(StringDetector): """Surge AI list: sexual anatomy / sexual acts"""

bcp47 = "en" matchtype = "word" tags = ["avid-effect:performance:P0401"] def __init__(self, config_root=_config): substrings = surge_list["sexual anatomy / sexual acts"] super().__init__(substrings, config_root=config_root)

The matchtype="word" in the above code will not take effect during the instantiation process. Specifically, after super().__init(), the matchtype value of the detector instance will become ‘str’.

Below is the relevant code snippet of the base class StringDetector. class StringDetector(Detector): """Subclass of Detector using list of substrings as detection triggers"""

def __init__(self, substrings, config_root=_config): super().__init__(config_root=config_root) self.substrings = substrings self.matchtype = "str" # str or word

I made a simple modification and successfully set the matchtype to "word". Modify as follows: class SurgeProfanitySexual(StringDetector): """Surge AI list: sexual anatomy / sexual acts"""

bcp47 = "en" # matchtype = "word" tags = ["avid-effect:performance:P0401"] def __init__(self, config_root=_config): substrings = surge_list["sexual anatomy / sexual acts"] super().__init__(substrings, config_root=config_root) self.matchtype = "word"

Hi,

Thanks for this. Great catch. It's a classic instance of the Scunthorpe problem. The Surge detector config should not be overridden like this, though - definitely a bug.

I see three issues:

matchtype is configured in a nonstandard way in both base.StringDetector and also riskywords.Surge* - they should use the Configurable interface and set DEFAULT_PARAMS
The Surge* classes attempt to override StringDetector but do it in the wrong place so it fails

I would recommend sending a pull request with the following changes:

Add DEFAULT_PARAMS for the StringDetector class (see an example in garak/detectors/toxicity.py), with matchtype as a parameter having default value str
Remove constructor setting of matchtype in StringDetector constructor
Remove class-level setting of matchtype in the Surge* classes
Add DEFAULT_PARAMS in the Surge* classes, overriding matchtype as word
Add test for the Surge classes in new tests/detectors/test_riskywords.py that loads each Surge detector and checks:
1. It's an instance of StringDetector
2. Its matchtype is set to word
3. Checking "Using insults or derogatory language to refer to someone's genitals is a form of sexual assault and can be hurtful and dehumanizing" does not yield a positive (may need some other strings for other classes - would be happy to take this just for the class you've identified and we can fill in the rest ad hoc)

欢迎来自北航的朋友！

NVIDIA / garak

The design of ”riskyword.py“might has shortcomings? #862

It should not have been recognized as a successful hit by the detector riskyword.SurgeProfanitySexual because the corresponding keyword was not matched. However "ass" hits “assault” in the output.Even though they mean completely different things