Better unsafe regex detector

davisjam commented 6 years ago

Hi all,

I'm a systems/security researcher at Virginia Tech and have been studying the incidence of vulnerable regexes in the wild.

This plugin's unsafe regex detector relies on safe-regex, which uses star height (nested quantifiers) to identify unsafe regexes.

Pros:

safe-regex is fast.
safe-regex is an npm module which makes it easy to work with.
safe-regex has no non-JS dependencies.

As a result, safe-regex is great for CI use cases.

Cons:

safe-regex is incorrectly implemented and substack is not maintaining it.
safe-regex has lots of false positives (e.g. (ab+)+).
safe-regex will only identify one type of exponential-time vulnerability, and ignores all polynomial-time vulnerabilities. In my research I found that, in the wild, polynomial-time vulnerabilities are far more common than exp-time vulnerabilities.

There are some alternatives to safe-regex that report exploit strings so you can tell if they're correct or not.

Rathnayake's rxxr2. Like safe-regex, this only checks for star height-style vulnerabilities. But it doesn't have false positives as far as I can tell.
Wustholz's REXPLOITER. This tests star height and other exp-time vulnerabilities, plus poly-time vulnerabilities.
Weideman's RegexStaticAnalysis. Like Wustholz's REXPLOITER, but open-source and it works better.

Unfortunately:

These alternatives all have non-JS dependencies (e.g. OCaml or Java) and have inconsistent interfaces.
Some (especially Weideman) can take minutes to test a single regex.

My project vuln-regex-detector provides a convenient wrapper for these alternatives, and enforces time and memory limits to get results or fail relatively quickly.

However, I'd be surprised if developers were willing to wait even 30 seconds for linting. To address that, I'm nearly done implementing a server side so queries can be answered by hitting the server for a pre-computed answer instead of doing the expensive computation locally. The server processes not-seen-before queries in the background so subsequent queries will get a real answer.

Once that's done, would you folks be interested in hitting my server first and falling back to safe-regex if my server hasn't seen the query before? I've got a sample client that can be used with a one-line tweak for this use case.

davisjam commented 6 years ago

@evilpacket I'm happy to submit a PR with a demonstration if desired.

stephanschubert commented 6 years ago

@evilpacket Sounds good?

evilpacket commented 5 years ago

@davisjam You've updated safe-regex since submitting this and I plan to bring in that change.

For something that hits an external source I'd like it to be configurable or a different rule and clear as to what is sent to the remote server etc. A lot of people get nervous when code they are scanning starts to get sent to a remote destination, but having this capability would be nice. Would the server side be something that others could also run themselves?

lirantal commented 5 years ago

@evilpacket the gist of it is probably all at https://github.com/davisjam/vuln-regex-detector/

devinrhode2 commented 5 years ago

Yeah I think best bet here is to not be sending anything to a remote server, and instead implement the caching stuff locally.... At least ask the person running this tool for permission when a new regex is being sent out

davisjam commented 5 years ago

The server-side code is available, but would have to be run on the user's side.

Unfortunately the existing advanced analyses are written in Java and OCaml, not JS.

I have half a million labeled regexes, so I suppose another option is to ship this database in safe-regex in compressed form as a "cache" of sorts.

MatthewHerbst commented 5 years ago

@davisjam is there a technical reason preventing the tools from being written in JS, or has that work just not been done by anyone?

davisjam commented 5 years ago

The analysis could be written in JavaScript, and I would happily incorporate it into safe-regex. I can point anyone interested in this in the right direction.

On Wed, Sep 11, 2019, 9:51 PM Matthew Herbst notifications@github.com wrote:

@davisjam https://github.com/davisjam is there a technical reason preventing the tools from being written in JS, or has that work just not been done by anyone?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nodesecurity/eslint-plugin-security/issues/28?email_source=notifications&email_token=AFOD3L2QPZXEZ2NYGR73E3LQJGOCNA5CNFSM4EYDDY7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6QMV4Q#issuecomment-530631410, or mute the thread https://github.com/notifications/unsubscribe-auth/AFOD3L2EY7V6EJKQCHBZAO3QJGOCNANCNFSM4EYDDY7A .

makenowjust commented 3 years ago

@davisjam I will introduce my research MakeNowJust-Labo/redos.

This is a hybrid ReDoS detection library. "Hybrid" means, it combines automaton based detector and fuzzing based detector. When the RegExp pattern is hard to be checked by the automaton based, it switches to use the fuzzing based. This allows checking many patterns in the real world.

This library is written in Scala, but it is compiled to JavaScript by Scala.js. So, you can use this from JavaScript easily. In addition, I published eslint-plugin-redos for easy use. Feel free to try it!

davisjam commented 3 years ago

@MakeNowJust Thanks for letting me know about it! Looks cool.

I looked over the documentation (not the implementation) but I'm not sure exactly what you've made. Are you building on the academic research on the topic (e.g. Shen for fuzzing, Weideman/Wustholz/Rathnayake for NFA analysis)? What guarantees does the library offer?

makenowjust commented 3 years ago

@davisjam I'm not sure the below may be an answer to you. Sorry that my English is poor.

Automaton based detector: It checks whether the pattern potentially causes catastrophic backtracking (exponential, and 2nd or higher degree polynomial matching time) or not. This detector is built on automata theory, like Weideman's RegexStaticAnalysis.
Fuzzing based detector: It finds a string in a limited size whose VM execution steps exceeds the specified limit by using the fuzzing technique. The VM is based on the QuickJS implementation, which is the simple and naive implementation of JS standard (ECMA-262).

More discussions are needed to conclude "what they detect are same.", but I believe both of them can find a vulnerability. And, each of them has pros and cons. For example, automaton based detector can find vulnerability precisely, however it is very slow on several patterns, and hard to support some syntax. On the other hand, fuzzing based detector can support all syntax, and we can control analyzing time easily, however it is hard to avoid false-negative. So, I think it is better to combine both.

Fuzzing based detector refers Shen's ReScue, but my implementation is very extended. In fact, ReScue cannot find a polynomial order case, but my implementation can in many cases.

I'm preparing a paper about this now. And, I'm going to talk about this at the 62nd programming symposium in Japan (I'll talk in Japanese, sorry). I'm planning to explain detailed ideas (detector selection algorithm, and fuzzing improvements) here.

davisjam commented 3 years ago

@MakeNowJust Happy to discuss more over email -- I sent you a note a few days ago.

thernstig commented 1 year ago

@nzakas would it make sense to replace https://www.npmjs.com/package/safe-regex with https://github.com/makenowjust-labs/recheck? ReCheck seems better maintained, but I am in no position to gauge if it is better or not nor if it is a smart choice to switch.

davisjam commented 1 year ago

(I own safe-regex. It's very simple, not much to maintain).

ReCheck is employing a more sophisticated algorithm that, based on the docs, should have more true positives and fewer false positives compared to safe-regex.

ReCheck will be more time consuming to run.

One potential check is how ReCheck would perform on a regex heavy codebase (say, the marked project for markdown). Based on my experience with similar tools, a local cache might be helpful for on-workstation use. I'm not sure if ReCheck has one.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Tobias Hernstig @.> Sent: Monday, January 30, 2023 4:11:47 AM To: eslint-community/eslint-plugin-security @.> Cc: Davis, James C @.>; Mention @.> Subject: Re: [eslint-community/eslint-plugin-security] Better unsafe regex detector (#28)

---- External Email: Use caution with attachments, links, or sharing data ----

@nzakashttps://github.com/nzakas would it make sense to replace https://www.npmjs.com/package/safe-regex with https://github.com/makenowjust-labs/recheck? ReCheck seems better maintained, but I am in no position to gauge if it is better or not nor if it is a smart choice to switch.

— Reply to this email directly, view it on GitHubhttps://github.com/eslint-community/eslint-plugin-security/issues/28#issuecomment-1408232698, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFOD3L5BIMOYH2KTYM7KN7TWU6AVFANCNFSM4EYDDY7A. You are receiving this because you were mentioned.Message ID: @.***>

thernstig commented 1 year ago

@davisjam your original post indicated that https://github.com/davisjam/vuln-regex-detector would be a better choice than https://github.com/davisjam/safe-regex, that you are both maintainer of.

Your last post in this thread indicated you reached out to @makenowjust but that seem to never have concluded something in this thread at least.

The question is then, are both projects worthy replacements and the option should be on the user, or what would be the best fit for eslint-plugin-security?

I think most code bases uses very few regexes that the extra time needed should be low enough so that the one with the most true positives and fewest false positives would be the best one to use, but that is my subjective opinion. We benchmarked safe-regex vs recheck and we are only talking milliseconds for some of the more complex regexes in our tests, but they were not exhaustive.

nzakas commented 1 year ago

The goal of the plugin is to try to catch as many problems as possible with as few false positives and false negatives as possible. Anything that moves us further in that direction is worthwhile to investigate.

That said, I'm not a security researcher, so I'd lean on the recommendations of folks like @davisjam to decide how to move forward here.

thernstig commented 1 year ago

@nzakas neither are we. We tested all of them and found the one by @makenowjust to have the least false positives and most true positives, but granted we just tested a few regexes. So that is very anecdotal and should not hold any ground for choosing one.

nzakas commented 1 year ago

If you could share your findings in this thread, maybe we can get some folks to review them and verify.

thernstig commented 1 year ago

@nzakas here are our simple results:

regex	Expected result	recheck async	recheck sync	safe-regex
`^test`	safe	safe (0.575 ms)	safe (30.778 ms)	safe (0.722 ms)
`^((a)+)+z$`	vulnerable	vulnerable (6.740 ms)	vulnerable (95.845 ms)	vulnerable (0.571 ms)
`^[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}$`	safe	safe (0.986 ms)	safe (4.159 ms)	vulnerable (1.422 ms)
`(\w+)\_(\w+)$`	vulnerable	vulnerable (2.950 ms)	vulnerable (43.759 ms)	safe (0.692 ms)
`^[0-9a-f]{4}$`	safe	safe (0.418 ms)	safe (1.655 ms)	safe (0.720 ms)
`^[0-9a-f]{4,30}$`	safe	safe (9.319 ms)	safe (228.656 ms)	safe (1.250 ms)
`^(?<foo>(?:Foo\\|Bar))$`	safe	safe (25.991 ms)	safe (51.447 ms)	safe (0.758 ms)
`^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$`	safe	safe (2.964 ms)	safe (3.086 ms)	safe (0.862 ms)

Using this script (not the prettiest):

import { performance } from 'node:perf_hooks';
import { check, checkSync } from 'recheck';
import safeRegex from 'safe-regex';

// See https://github.com/davisjam/vuln-regex-detector/issues/81
// 'vuln-regex-detector' is not tested as it does not work properly

const regexes = [
  { regex: /^test/, vulnerable: false },
  { regex: /^((a)+)+z$/, vulnerable: true },
  { regex: /^[0-9a-f]{8}-([0-9a-f]{4}-){3}[0-9a-f]{12}$/, vulnerable: false },
  { regex: /(\w+)\_(\w+)$/, vulnerable: true },
  { regex: /^[0-9a-f]{4}$/, vulnerable: false },
  { regex: /^[0-9a-f]{4,30}$/, vulnerable: false },
  { regex: /^(?<foo>(?:Foo|Bar))$/, vulnerable: false },
  {
    regex: /^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$/,
    vulnerable: false,
  },
];

const resultPrinter = (result) => (result ? 'safe' : 'vulnerable');

// Call each function once in case they have code just running once on first call
safeRegex('foo');
await check('foo', '');

for (const regex of regexes) {
  const pattern = regex.regex.source; // String (RegExp.prototype.source)
  let name; // npm package name
  let result;

  // recheck async
  name = 'recheck-async';
  performance.mark(`start-${name}`);
  result = await check(pattern, '');
  performance.mark(`end-${name}`);
  performance.measure(`${name} ${pattern}`, {
    detail: { name, pattern, result: result.status },
    start: `start-${name}`,
    end: `end-${name}`,
  });

  // recheck sync
  name = 'recheck-sync';
  performance.mark(`start-${name}`);
  result = checkSync(pattern, '');
  performance.mark(`end-${name}`);
  performance.measure(`${name} ${pattern}`, {
    detail: { name, pattern, result: result.status },
    start: `start-${name}`,
    end: `end-${name}`,
  });

  // safe-regex
  name = 'safe-regex';
  performance.mark(`start-${name}`);
  result = safeRegex(pattern);
  performance.mark(`end-${name}`);
  performance.measure(`${name} ${pattern}`, {
    detail: { name, pattern, result: resultPrinter(result) },
    start: `start-${name}`,
    end: `end-${name}`,
  });
}

const map = new Map();

performance.getEntriesByType('measure').forEach((measure) => {
  if (map.has(measure.detail.pattern)) {
    map.get(measure.detail.pattern).push(measure);
  } else {
    map.set(measure.detail.pattern, [measure]);
  }
});

map.forEach((measures, pattern) => {
  console.log(`regex: /${pattern}/`);
  console.log(
    `  expected result: ${
      regexes.filter((regex) => regex.regex.source === pattern)[0].vulnerable
        ? 'vulnerable'
        : 'safe'
    }`
  );
  measures.forEach((measure) => {
    console.log(
      `  ${measure.detail.name}: ${
        measure.detail.result
      } (${measure.duration.toFixed(3)} ms)`
    );
  });
  console.log();
});

nzakas commented 1 year ago

Thanks. For clarity, can you also add a column for expected result? (Useful for places where different checkers return different results.)

thernstig commented 1 year ago

Updated.

@davisjam and @makenowjust can you double-confirm the above results?

jgeurts commented 1 year ago

We've seen false positives with the unsafe regex rule and named capture groups. Would it be possible to add a row to that table for regular expressions with named capture group(s)? E.g. ^(?<foo>(?:Foo|Bar))$ or maybe ^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$

thernstig commented 1 year ago

@jgeurts I tried to update the table a few days ago, but ran into bugs with both of the packages when trying again:

See: https://github.com/makenowjust-labs/recheck/issues/758 https://github.com/davisjam/vuln-regex-detector/issues/81

I expect @makenowjust to fix the recheck problem, but I am not sure about vuln-regex-detector as the repository has very low activity.

Once I get some fixes, I plan to add a script in this issue that can be used to compare them. I will then add your regexes @jgeurts.

nzakas commented 1 year ago

Thanks @thernstig! We will wait to move forward for that update.

davisjam commented 1 year ago

Looks like @makenowjust has been able to resolve the issue in recheck 🥳 . I do not think I have time to resolve the issue in vuln-regex-detector.

FWIW The table of results ("expected result") in https://github.com/eslint-community/eslint-plugin-security/issues/28#issuecomment-1420883328 looks correct to me.

thernstig commented 1 year ago

@nzakas @makenowjust @jgeurts I have updated the table. Scrutinize the code used to test if you have time.

nzakas commented 1 year ago

Thanks so much! It does look like recheck is more accurate, but to get a true apples-to-apples comparison, I think we need to run checkSync instead of check for recheck. The check method is running asynchronously so the timing information isn't accurate.

makenowjust commented 1 year ago

check and checkSync are quite different: check forks the native recheck binary and uses it, but checkSync uses the pure JS implementation compiled by Scala.js. Thus, checkSync is often slower than check in the long term. However, since ESLint limitation, we cannot use check in ESLint plugin.

thernstig commented 1 year ago

@nzakas @makenowjust updated the table with checkSync. Definitely the results became worse. Great if ESLint in the future can allow for async calls.

So changing will suffer quite a bit of performance, but be more correct.

@nzakas I assume it is up to you to decide now what to do with this information. Let me know if there are other packages and/or regexes you want added.

ota-meshi commented 1 year ago

If asynchronous processing is a bottleneck for us, I think we might be able to use synckit. https://github.com/un-ts/synckit

thernstig commented 1 year ago

I assume that ESLint intends to allow for asynchronous processing sooner or later? Maybe better focus on fixing that?

nzakas commented 1 year ago

Async rules in ESLint are still in the future, so we can't wait for that.

Unfortunately, I don't think we can use recheck right now considering how much slower it is. If it were just a little slower, that would be one thing, but in some cases it's an order of magnitude slower, and I don't think the tradeoff is worth it.

thernstig commented 1 year ago

Let's see if @makenowjust can solve this as he is implementing synckit. If that improves the numbers using synckit is that acceptable @nzakas ?

nzakas commented 1 year ago

If that can bring the numbers closer to safe-regex, then definitely. Although it might be worth the effort to see if we can improve safe-regex to be more accurate.

kurtextrem commented 1 year ago

Looks like there is now also https://github.com/tjenkinson/eslint-plugin-redos-detector, using https://github.com/tjenkinson/redos-detector in case anyone is curious (or looking for alternatives). I haven't measured how redos-detector compares to recheck though.

eslint-community / eslint-plugin-security

Better unsafe regex detector #28