Open nodtem66 opened 3 years ago
This is cool! Is Regexp Score
the 'rarity' in this case? Can you build a function that other people can use to easily test this? Maybe put it into /scripts
?
If you do Regexp Score mod 10
it'll put it into the 10-point range for us :) And I might suggest rounding the result in another column too (so 0.51 becomes 1, 0.49 becomes 0) just to see what that's like :D
thanks so much for this, this is absolutely great 🔥 A non-subjective formal way to define rarity would be absolutely amazing :)
It seems to be promising, but it's just a prototype and needs a lot of works. For example:
Google Cloud Platform API Key
Regexp = "(?i)^([0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12})$"
rarity = 0.8
score = 0.07
By the way, you can play the script in Google Colab. This is the RegExScore class source code:
import sre_parse
import strings
# calculate score from literal string (e.g. prefix, suffix)
# use weight metric (*_score) parameters
class RegExScore():
def __init__(self,
repeat_score = 0.01, # score for quantifier `{0,}`
in_score = 0.1, # score for character set `[*]`
ascii_score=1.0, # score for a fixed ascii `a-zA-z`
digit_score=0.2, # score for a fixed digit `0-9`
literal_default_score=0.01, # score for whitespaces
debug=False # print the debug message
):
self.repeat_score = repeat_score
self.in_score = in_score
self.ascii_score = ascii_score
self.digit_score = digit_score
self.literal_default_score = literal_default_score
self.debug = debug
def calculate(self, regexp:str):
return self.token_score(sre_parse.parse(regexp))
def token_score(self, tokens:tuple):
score = 0
for _token in tokens:
if self.debug:
print("Loop: ", _token)
# add the score from subpattern `()`
if _token[0] == sre_parse.SUBPATTERN:
_, _, _, child = _token[1]
if self.debug:
print(_token[0], len(child))
score += self.token_score(child)
# add score from quantifier `{min,max}`
elif _token[0] == sre_parse.MAX_REPEAT:
_min, _max, child = _token[1]
_score = self.repeat_score * (_min + 0 if _max == sre_parse.MAXREPEAT else _max)
if self.debug:
print('\tscore:', _score)
score += _score + self.token_score(child)
# add score from mean of branch group `A|B|C|D`
elif _token[0] == sre_parse.BRANCH:
_, branch = _token[1]
if self.debug:
print('\tbranch:', len(branch))
sub_score = 0
for child in branch:
sub_score += self.token_score(child)
score += sub_score / float(len(branch))
# add score from character set `[]`
elif _token[0] == sre_parse.IN:
if self.debug:
print('\tscore:', self.in_score)
score += self.in_score
# add score from fixed literal
elif _token[0] == sre_parse.LITERAL:
literal = chr(_token[1])
if self.debug:
print('\tchr:', literal)
if literal in string.ascii_letters:
score += self.ascii_score
elif literal in string.digits:
score += self.digit_score
else:
score += self.literal_default_score
return score
Feel free to comment or suggest your thoughts. I'm looking forward to discussing this with anyone.
Is your feature request related to a problem? Please describe. The rarity is used to sort the result of PyWhat. However, I feel like it's a subjective value that didn't have a formal way to define it. Currently, I'm just looking at the rarity of neighboring RegExps, using my own gut to decide it, and waiting for someone to reject or confirm.
This is a current definition of a rarity on the wiki page
I need some way to calculate the rarity of RegExps or tokens.
Describe the solution you'd like Deterministic way to calculate rarity
Describe alternatives you've considered N/A (I can't figure out the alternatives)
Additional context