bee-san / pyWhat

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️
MIT License
6.52k stars 350 forks source link

[Proposal] Rarity score from RegEx #238

Open nodtem66 opened 2 years ago

nodtem66 commented 2 years ago

Is your feature request related to a problem? Please describe. The rarity is used to sort the result of PyWhat. However, I feel like it's a subjective value that didn't have a formal way to define it. Currently, I'm just looking at the rarity of neighboring RegExps, using my own gut to decide it, and waiting for someone to reject or confirm.

This is a current definition of a rarity on the wiki page

rarity (float between 1.0 and 0.0): How unlikely is it to be a false-positive? Think about how big of a chance there is of something completely different to match your regex. Choose 1 for very unlikely, 0 for very likely.

Some tips on how to pick rarity:

1 - contains a word that is unique to it
0.7 - matches to a specific pattern and characters
0.5 - mostly matching to specific characters
0.3 - pretty broad, has only a few specific characters
0.2 - broad, almost no specific characters
0 - matches to almost everything

I need some way to calculate the rarity of RegExps or tokens.

Describe the solution you'd like Deterministic way to calculate rarity

Describe alternatives you've considered N/A (I can't figure out the alternatives)

Additional context

bee-san commented 2 years ago

This is cool! Is Regexp Score the 'rarity' in this case? Can you build a function that other people can use to easily test this? Maybe put it into /scripts?

If you do Regexp Score mod 10 it'll put it into the 10-point range for us :) And I might suggest rounding the result in another column too (so 0.51 becomes 1, 0.49 becomes 0) just to see what that's like :D

thanks so much for this, this is absolutely great 🔥 A non-subjective formal way to define rarity would be absolutely amazing :)

nodtem66 commented 2 years ago

It seems to be promising, but it's just a prototype and needs a lot of works. For example:

By the way, you can play the script in Google Colab. This is the RegExScore class source code:

import sre_parse
import strings

# calculate score from literal string (e.g. prefix, suffix)
# use weight metric (*_score) parameters
class RegExScore():
  def __init__(self, 
               repeat_score = 0.01, # score for quantifier `{0,}`
               in_score = 0.1,      # score for character set `[*]`
               ascii_score=1.0,     # score for a fixed ascii `a-zA-z`
               digit_score=0.2,     # score for a fixed digit `0-9`
               literal_default_score=0.01, # score for whitespaces
               debug=False # print the debug message
    ):
    self.repeat_score = repeat_score
    self.in_score = in_score
    self.ascii_score = ascii_score
    self.digit_score = digit_score
    self.literal_default_score = literal_default_score
    self.debug = debug

  def calculate(self, regexp:str):
    return self.token_score(sre_parse.parse(regexp))

  def token_score(self, tokens:tuple):
    score = 0
    for _token in tokens:
      if self.debug:
        print("Loop: ", _token)

      # add the score from subpattern `()`
      if _token[0] == sre_parse.SUBPATTERN:
        _, _, _, child = _token[1]
        if self.debug:
          print(_token[0], len(child))
        score += self.token_score(child)
      # add score from quantifier `{min,max}`
      elif _token[0] == sre_parse.MAX_REPEAT:
        _min, _max, child = _token[1]
        _score = self.repeat_score * (_min + 0 if _max == sre_parse.MAXREPEAT else _max)
        if self.debug:
          print('\tscore:', _score)
        score += _score + self.token_score(child)
      # add score from mean of branch group `A|B|C|D`
      elif _token[0] == sre_parse.BRANCH:
        _, branch = _token[1]
        if self.debug:
          print('\tbranch:', len(branch))
        sub_score = 0
        for child in branch:
          sub_score += self.token_score(child)
        score += sub_score / float(len(branch))
      # add score from character set `[]`
      elif _token[0] == sre_parse.IN:
        if self.debug:
          print('\tscore:', self.in_score)
        score += self.in_score
      # add score from fixed literal
      elif _token[0] == sre_parse.LITERAL:
        literal = chr(_token[1])
        if self.debug:
          print('\tchr:', literal)
        if literal in string.ascii_letters:
          score += self.ascii_score
        elif literal in string.digits:
          score += self.digit_score
        else:
          score += self.literal_default_score
    return score

Feel free to comment or suggest your thoughts. I'm looking forward to discussing this with anyone.