Shorten string before applying regex

JackuB commented 8 years ago

I did some hard science: http://jsperf.com/deckardcain-string-shortening/3

I think we should change flow to this:

run with short string (maybe not all tests)
when no match, continue tests, but with original string

What do you think? @hj @freaz

freaz commented 8 years ago

I am just thinking if it is necessary to have it here? Basically it's optimisation that can be applied in place where detection is done.

Anyway it is good carch that running rexexp against huge codes is unnecessary, because in most cases we are able to detect type from couple of chars from beginning.

So lets do it. :+1:

honzajavorek commented 8 years ago

:+1:

However, it has scenarios with wrong results:

Input is JSON/YAML Swagger, which also contains some string values indicating API Blueprint/legacy Apiary Blueprint. (test)
The input has the "swagger": key, but somewhere at the bottom as order of properties is undefined in JSON or YAML and shouldn't matter.
In this case, Deckard will shorten the input and may identify it as API Blueprint/legacy Apiary Blueprint.

honzajavorek commented 8 years ago

You may apply another optimization: Looking for swagger word in the input string using just .indexOf, which, I believe, is the fastest way to do simple lookup (I have no prove for this). If there is a match, you may proceed to the heavyweight RegExp matching. Also, .indexOf will give you position, so you have information about how much of the input you may afford to cut off.

freaz commented 8 years ago

@honzajavorek To you concern about wrong identification. I think we shouldn't do this https://github.com/apiaryio/deckardcain/blob/master/src/deckardcain.js#L60 at all.

And to yours optimization: Do you know it will have significant impact? Also this approach can be very simply tricked:

FORMAT: 1A
# swagger

My final proposal is to add simple optimizations for each type e.g.:

API Blueprint As we know format must be at the beginning.

if (source.substring(0, SAMPLE_LENGTH).match(API_BLUEPRINT_HEADER)) {
// ...
}

For swagger in yaml we can use sample as well, because we know that most of the swagger documents have it specified at the beginning. So in most cases it will speed up.

With this way we avoid running all tests twice and also will have targeted optimization for each type.

@JackuB Did you run into problem with speed or is it just pre optimization?

JackuB commented 8 years ago

@freaz yep, only on really big blueprints (+10k LOC). Typing can get laggy, but I didn't test how much is it affected by ACE itself etc.

It's not big issue right now - works fine for most blueprints. Anyway, I have idea for PR, will do it once I'll have some time.

honzajavorek commented 8 years ago

To you concern about wrong identification. I think we shouldn't do this https://github.com/apiaryio/deckardcain/blob/master/src/deckardcain.js#L60 at all.

This https://github.com/apiaryio/deckardcain/blob/master/src/deckardcain.js#L55 carries the same vulnerability and it's the core of Apiary Blueprint identification, so my concern is valid even if we do not check for + Response.

Do you know it will have significant impact?

I did no measurements. My intention was to offload "expensive" regexp matching only to input indicating the regexp has some probability to match anything, because this way we can optimize and at the same time we do not put us into the danger of wrong identification. Shortening the input based on wishful thinking that the swagger keyword is usually at the top so it will be always at the top is just wrong, IMHO.

Also this approach can be very simply tricked

My approach would take this input as:

Deckard would use .indexOf with swagger. Result: It is present, thus it's worth to try RegExp. The .indexOf provides position, so we can possibly guess how much we can cut off without losing the interesting part.
Deckard would cut off the input as much as it can (the length being based on the position given by previous step) and use RegExp to search for Swagger JSON/YAML. Result: It's not valid Swagger.
Deckard would continue with other heuristics, possibly cutting off most of the input as other RegExps work just with the beginning of the input. Result: The input would be correctly identified.

My final proposal is to add simple optimizations for each type

I agree, this is reasonable. I do not agree we can use simple fixed-length sample for either Swagger JSON or YAML.

honzajavorek commented 8 years ago

My algorithm breaks into pieces, because when it comes to whether indexOf is faster and simpler than RegExp, the answer is: not necessarily... :neutral_face:

http://jsperf.com/deckard-indexof-vs-regexp

But still, I think searching for just swagger at first (either by indexOf or by match) could be still faster than checking the full RegExp, but the question is whether it's worth it.

apiaryio / deckardcain

Shorten string before applying regex #20