Cache: Design: Server replies early with UNKNOWN rather than running the detectors synchronously.

josdejong commented 6 years ago

I noticed that when using both vulnRegexDetector.test and vulnRegexDetector.testSync, testing a new regexp always returns UNKNOWN the first few times, and only after a minute or so it returns VULNERABLE or SAFE. It would be nice if the library simply waits until the server knows the result and only then return it, right now I cannot reliably use the library in automated unit tests.

davisjam commented 6 years ago

This behavior is by design. Testing a regex can take worst-case O(minutes)/regex * #regexes in line so I don't see a way to get a response truly synchronously without holding a ton of connections open or paying AWS to scale automatically.

Aside: This is a great fit for serverless if you know anyone giving that away for free.

If you really want to wait for a real response, here are the options I see:

Run code like this:

while (response === UNKNOWN) {
  response = test(...);
}

but this will make my server do extra work.

Install the "application" tools from this repo (src/detect, src/validate, access via bin/check-regex.pl) and run them locally. These tools are written in OCaml (Rathnayake) and Java (Wustholz, Weideman). The ./configure script will install and build everything so you can run your own queries.
Convert the detectors to JavaScript ;-)

I'm open to other suggestions but I don't see any good alternatives right now.

josdejong commented 6 years ago

I understand that testing can require a lot of CPU and that this can get costly.

I'm looking from a usability point of view though. People using the service do that to get an answer to the question whether their regexp is safe or not. If the service returns "UNKNOWN", this is not helpful at all and strongly hurts usability. So people will either have to (a) write a loop to regularly check again until the service returns VULNERABLE/SAFE, or (b) find an other service or application that can answer this question.

If you do want this service to be usable in practice I think it's better to solve this issue in the library itself (where you are in control), rather than leave it up to the users to write their own while-loops with unknown intervals (where you don't have control on the impact on your server). I guess you will have to do some testing on what's least expensive: keeping a request open until the test is finished, or regularly poll via a new request until the test is finished.

Ideal would indeed be if people can easily run the tests locally on their own computer. That would mean porting the logic from Java to node.js, or maybe embedding the Java application inside vuln-regex-detector and run it via node-java or something like that.

josdejong commented 6 years ago

As for hosting costs: you could simply approach Amazon or Heroku or others, explain your project (making the software world a bit more safe), and ask if they would be interested in giving you hosting for free for this service. Some companies are really positive about these sort of initiatives and eager to lend a helping hand.

davisjam commented 6 years ago

If the service returns "UNKNOWN", this is not helpful at all and strongly hurts usability.

The intended use case is CI (see my "pseudo eslint plugin") where the code base is being scanned repeatedly. With this approach an answer should be available relatively soon (e.g. within 1-2 hours in the worst case) so across two CI runs you'll get an answer. This isn't the same as immediately, but I wouldn't describe it as unusable nor unhelpful.

The client does save results in a local cache once it gets a solid answer, reducing the overall network traffic.

If you think this design is unacceptable, I'm happy to discuss a new design and review a PR. I can update the code on the current server as needed.

Ideal would indeed be if people can easily run the tests locally on their own computer.

I've made this as easy as I know how. Run ./configure and then look in bin/ for scripts to run queries against individual regexes, files, directory trees, and git URLs. These are all documented in the project READMEs.

The client/server setup is intended to remove the dependency costs and permit easy integration into existing CI setups.

you could simply approach Amazon or Heroku or others

At the moment I don't have the time to pursue this myself. Seems a bit beyond my purview as a researcher. If anyone wants to take the lead on this, I'd be delighted to join in the fun.

josdejong commented 6 years ago

Yes, sorry, this doesn't render the service completely unusable.

At my work, alarm bells go off when a CI build fails, you want to prevent that from happening needlessly. For unit tests, integration tests, and CI one aspect I find very important is that it runs reliability and consistently. But if the CI system "sometimes" fails, or fails on a next build for issues of some time ago rather than the changes at hand, that is really bad for the confidence in the CI system. I personally would implement a polling mechanism myself then.

I've made this as easy as I know how. Run ./configure and then look in bin/ for scripts to run queries against individual regexes, files, directory trees, and git URLs. These are all documented in the project READMEs.

Ah, I hadn't seen that yet in the docs (was looking at the docs on npm which differ from those on github). That sounds promising, thanks. Will give that a try.

davisjam commented 6 years ago

See here for whole repo docs and here for bin/ docs.

josdejong commented 6 years ago

Thanks!

davisjam commented 6 years ago

Do let me know your thoughts when you've had a chance to look it over.

josdejong commented 6 years ago

ok will do!

davisjam commented 6 years ago

It looks like a microservices approach like AWS Lambda might be a good fit for a "don't reply until you have an answer" design. The expected workload should fit in the free-or-very-cheap tier. Initial funding could come from e.g. the AWS Research Credits program.

If I understand the architecture right, each lambda could handle a single request, check a backend DB, and either run the detectors locally or return the DB's stored result.

josdejong commented 6 years ago

I tried to run the check-regex.pl script to test a single regex, no luck so far though I think I'm close. What would help me is:

A link from the README.md in the root of the project to the readme in the folder /bin (I only found it thanks to you linking it above)
In the /bin/README.md a concrete usage example on how to call the scripts.

In the end I tried the following, which gives "INTERNAL-ERROR" results:

jos@jostrix ~/tmp/vuln-regex-detector $ VULN_REGEX_DETECTOR_ROOT=. ./bin/check-regex.pl ./bin/test/check-regex/unsafe-1.json 
Config says to use the cache
Config says useCache 1
Query says I should not use the cache
Using default for detectVuln_memoryLimit: 8192
Using default for detectVuln_timeLimit: 60
Querying detectors
./src/detect/detect-vuln.pl /tmp/check-regex-6491.json 2>>/tmp/check-regex-6491-progress.log
Checking rathnayake-rxxr2 for timeout-triggering evil input
Checking weideman-RegexStaticAnalysis for timeout-triggering evil input
Checking wuestholz-RegexCheck for timeout-triggering evil input
{"isVulnerable":0,"pattern":"(a+)+$","detectReport":{"detectorOpinions":[{"secToDecide":"0.0160","hasOpinion":0,"name":"rathnayake-rxxr2","opinion":"INTERNAL-ERROR"},{"secToDecide":"0.0847","opinion":"INTERNAL-ERROR","name":"weideman-RegexStaticAnalysis","hasOpinion":0},{"hasOpinion":0,"opinion":"INTERNAL-ERROR","name":"wuestholz-RegexCheck","secToDecide":"0.0409"}],"memoryLimit":"8192","timeLimit":"60","pattern":"(a+)+$"}}

davisjam / vuln-regex-detector

Cache: Design: Server replies early with UNKNOWN rather than running the detectors synchronously. #39