Charcoal-SE / metasmoke

Web dashboard for SmokeDetector.
https://metasmoke.erwaysoftware.com
Creative Commons Zero v1.0 Universal
43 stars 34 forks source link

blacklist pull requests: Improve regex for Metasmoke TP/FP statistics #54

Closed tripleee closed 6 years ago

tripleee commented 7 years ago

For example, https://github.com/Charcoal-SE/SmokeDetector/pull/430#issuecomment-270069822 indicates a problem with word boundaries. The test should probably add something like (^|[^A-Za-z0-9_]) before the phrase and ($|[^A-Za-z0-9_]) after it.

Undo1 commented 7 years ago

For future reference, the relevant code is https://github.com/Charcoal-SE/metasmoke/blob/master/app/controllers/github_controller.rb#L65.

tripleee commented 7 years ago

Other things which work poorly are \W which is used a lot in blacklist requests now, lookaheads / lookbehinds (no support for these in MySQL) and the fact that we should probably search title and body at the same time. Should there be more than one link? Or should the search be extended to have a "blacklist request support" mode where both title and body are searched, and some regex extensions are hacked in? (We could replace \W with [^A-Za-z0-9_], for example.)

tripleee commented 7 years ago

I also note that Smoke Detector (somewhat whimsically IMHO) replaces a space with \s before adding something to the blacklist. I don't think that's the way to go, really.

Undo1 commented 7 years ago

Could you expand on that last comment @tripleee? IIRC we had a discussion about that a long time ago. Why isn't it good to generalize whitespace matching?

tripleee commented 7 years ago

I think it violates the rule of least surprise. Either we support full regex and accept exactly the original input, or we have our own formalism (escaping all dots by default would seem more useful than the space to \s replacement, for example; but I think no replacing makes more sense).

I'm also thinking that if we remap stuff in the first place, we should remap in the opposite direction when we link to Metasmoke. For example, we should remap \W to something like [^a-z0-9_] in the searches we generate in PRs from users who don't have code admin privileges (... I think we call them).

Bottom line, if we do this at all, we should have a disciplined rationale rather than random organically grown syntactic sugar crust.

tripleee commented 7 years ago

(Sorry for repeating myself; the mobile Github app I'm trying out is otherwise nice, but doesn't show me earlier messages in the thread when composing a new comment. Also, no editing of the comment I just submitted.)

angussidney commented 7 years ago

Note that the regex substitutions which I introduced in gitmanager a while ago were to maintain compatiability with MS's basic regex capabilities, etc. The way the blacklist command works has changed a lot since then so it could do with updating.

ArtOfCode- commented 6 years ago

Actually works now