docopt / docopt.cpp

C++11 port of docopt
Boost Software License 1.0
1.05k stars 146 forks source link

Regex expressions cause stack overflows in MSVC 2015 #49

Closed mfrischknecht closed 4 years ago

mfrischknecht commented 8 years ago

First of all, let me thank you for this awesome library!

The standard <regex> implementation for Visual C++ seems to have a rather severe design issue that seems to cause stack overflows for complicated expressions (I suspect the lookaheads you use, but I'm not really sure about that). A couple of google searches revealed that this seems to be a known issue. Some people seem to have been able to avoid the problem by increasing their stack size, but that didn't help in my case (I've added an example program below that reproduces the problem; the nested method calls in std::_Matcher are really quite insane).

I've already written a patch for parse_section that resolves my original problem, but I still get a similar error in parse_defaults with the demo program below. Also, I unfortunately don't have any git-tools at my disposal right now, so I can't start a pull request for this. For now, I'll post my version of parse_section in the comments and see if I can come up with a similar "fix" for the other expressions.

Here's my test program to reproduce the problem:

#include <docopt.h>
#include <iostream>
#include <string>

void printArguments(const std::map<std::string, docopt::value> &args)
{
    std::cout << "Arguments:\n\n";
    for (auto &entry : args)
    {
        const std::string &name = entry.first;
        const docopt::value &value = entry.second;

        std::cout << name << ": \n";

        if (value.isString())
            std::cout << "       (string)   " << value.asString() << "\n\n";
        else if (value.isLong())
            std::cout << "       (long)     " << value.asLong() << "\n\n";
        else if (value.isBool())
            std::cout << "       (bool)     " << value.asBool() << "\n\n";
        else if (value.isStringList())
        {
            const auto &list = value.asStringList();
            std::cout << "       (str_list) [\n";

            for (auto &str : list)
                std::cout << "                        " << str << ",\n";
            std::cout << "                  ]\n\n";
        }
        else if (!value) std::cout << "       (empty)\n\n";
        else std::cout << "       (unknown)\n\n";

    }
    std::cout << std::endl;
}

const auto USAGE =
R"(
Usage:
    foo [options] <ARG1> <ARG2> [--option1=<OPTION1>] [--option2=<OPTION2>]
    foo --command1 <ARG3> <ARG4> <ARG5> <ARG6>  [--option3=<OPTION3> [--option4=<OPTION4>]] [--option5=<OPTION5>]
    foo --command2 <ARG4>
    foo (-v | --version)
    foo (-h | --help)

Options:
    -o <OPTION6>, --option6=<OPTION6>      Some rather long description of option #6 which makes this line longer than it really should be...
)";

void main(int argc, const char** argv)
{
    try
    {
        auto arguments = docopt::docopt(USAGE, { argv + 1,argv + argc }, true, "1.0.0.0", true);
        printArguments(arguments);
    }
    catch (std::exception &e)
    {
        std::cerr << "Encountered exception of type "
            << typeid(e).name() << ": "
            << e.what();
    }
    std::cin.ignore();
}
mfrischknecht commented 8 years ago

This is the aforementioned patch:

static std::vector<std::string> parse_section(std::string const& name, std::string const& source) {
#ifndef _MSC_VER
    // ECMAScript regex only has "?=" for a non-matching lookahead. In order to make sure we always have
    // a newline to anchor our matching, we have to avoid matching the final newline of each grouping.
    // Therefore, our regex is adjusted from the docopt Python one to use ?= to match the newlines before
    // the following lines, rather than after.
    std::regex const re_section_pattern{
        "(?:^|\\n)"  // anchored at a linebreak (or start of string)
        "("
           "[^\\n]*" + name + "[^\\n]*(?=\\n?)" // a line that contains the name
           "(?:\\n[ \\t].*?(?=\\n|$))*"         // followed by any number of lines that are indented
        ")",
        std::regex::icase
    };

    std::vector<std::string> ret;
    std::for_each(std::sregex_iterator(source.begin(), source.end(), re_section_pattern),
        std::sregex_iterator(),
        [&](std::smatch const& match)
    {
        ret.push_back(trim(match[1].str()));
    });

    return ret;
#else
    //Parse the sections in a more old-fashioned way in order to avoid stack overflows in Microsoft's regex implementation
    std::regex const re_section_start_pattern{ "(?:^|\\n)[^\\n]*" + name, std::regex::icase};
    std::sregex_iterator const re_end;

    std::vector<std::string> ret;
    auto const source_end = source.end();
    std::sregex_iterator section_head(source.begin(), source_end, re_section_start_pattern);
    for (; section_head != re_end; ++section_head)
    {
        //We found a section header
        auto section_start = source.begin() + section_head->position();
        if (*section_start == '\n') ++section_start; //If it is positioned on a newline, move ahead one char

        //Add the complete header line to the section
        auto section_end = std::find(section_start, source_end, '\n');
        if (section_end != source_end) ++section_end; //skip '\n'

        while (section_end != source_end)
        { //Add subsequent lines to the section if they're indented
            if (*section_end != '\t' && *section_end != ' ') break; //Unindented lines end the section
            section_end = std::find(section_end, source_end, '\n');
            if (section_end != source_end) ++section_end; //skip '\n'
        }
        ret.push_back(trim({section_start,section_end}));
    }
    return ret;
#endif
}

static std::vector<Option> parse_defaults(std::string const& doc) {
#ifndef _MSC_VER
    // This pattern is a bit more complex than the python docopt one due to lack of
    // re.split. Effectively, it grabs any line with leading whitespace and then a
    // hyphen; it stops grabbing when it hits another line that also looks like that.
    static std::regex const pattern {
        "(?:^|\\n)[ \\t]*"  // a new line with leading whitespace
        "(-(.|\\n)*?)"      // a hyphen, and then grab everything it can...
        "(?=\\n[ \\t]*-|$)" //  .. until it hits another new line with space and a hyphen
    };

    std::vector<Option> defaults;

    for(auto s : parse_section("options:", doc)) {
        s.erase(s.begin(), s.begin()+static_cast<std::ptrdiff_t>(s.find(':'))+1); // get rid of "options:"

        std::for_each(std::sregex_iterator{ s.begin(), s.end(), pattern },
                  std::sregex_iterator{},
                  [&](std::smatch const& m)
        {
            std::string opt = m[1].str();

            if (starts_with(opt, "-")) {
                defaults.emplace_back(Option::parse(opt));
            }
        });
    }

    return defaults;
#else
    static std::regex const pattern {"(?:^|\\n)[\\t ]*-"};
    std::vector<Option> defaults;

    for (auto s : parse_section("options:", doc)) 
    {
        s.erase(s.begin(), s.begin() + static_cast<std::ptrdiff_t>(s.find(':')) + 1); // get rid of "options:"

        std::sregex_iterator option_line(s.begin(), s.end(), pattern);
        std::sregex_iterator re_end;
        if (option_line == re_end) continue;

        auto option_begin = std::find(s.begin() + option_line->position(), s.end(), '-');
        ++option_line;

        auto option_end = s.end();
        for (; option_line != re_end; ++option_line)
        {
            option_end = s.begin() + option_line->position();
            defaults.emplace_back(Option::parse({option_begin,option_end}));
            option_begin = std::find(option_end, s.end(), '-');
        }
        defaults.emplace_back(Option::parse({option_begin,s.end()}));
    }
    return defaults;
#endif
}
jaredgrubb commented 8 years ago

Can you file a bug on Microsoft about this? The one you link is marked "behaves correctly", and the regex in docopt is not all that odd. I've seen far worse and the fact that their compiler has trouble with one like this seems like a serious defect.

I'm ok with your patch in theory, but it kinda scares me to have to add #ifdef for every variant of compiler when the original code is perfectly fine and standards-conforming. (There's a bug in GCC 4.8 as well that I've resisted fixing for the same reason)

BurntSushi commented 8 years ago

I've seen far worse and the fact that their compiler has trouble with one like this seems like a serious defect.

Your regex has nested quantifiers, which is a typically one of the things that gives backtracking regex engines problems. It's not really a defect in the regex engine, but rather an intrinsic property of the regex engine you chose. You might consider experimenting with possessive quantifiers to limit backtracking or figure out a way to rewrite the regex without nested repetitions (if that is indeed the problem here).

mfrischknecht commented 8 years ago

Here's the bug report for VC++: https://connect.microsoft.com/VisualStudio/feedback/details/2774916

That said, I would suspect the answer to be the same as for the older bug report (for the reasons @BurntSushi noted). I'm no expert in regex implementations, but I can imagine the according automaton for the parse_section function in particular to be rather complicated.

BurntSushi commented 8 years ago

@mfrischknecht To be clear, it's not about complexity. A backtracking regex implementation probably doesn't use automata. The issue is its exponential worst case time complexity. For example, matching something as simple on (a*)*b on aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa will take a very long time. (At least, Python's regex implementation does. I haven't tried C++'s on that one.)

jaredgrubb commented 8 years ago

Let me start by saying I'm not claiming these regex's are ideal. I tried to write the simplest ones I could that would (1) pass the official docopt test suite and (2) are easy-ish to read/maintain. If someone wants to optimize them or improve them, and they still meet those two criteria, I'll happily take those patches.

But, come on, can you imagine if your text editor (or grep or Python or Perl) hitting a stack overflow if you type the wrong 50-character regex, and then the bug report came back as "you should optimize your regex"? There are 50 years of regex engines that can handle this and far worse and this library is working fine with libc++, libstdc++, and boost (except GCC 4.8's libstdc++).

As a side note: someone recently added a patch to docopt.cpp to allow you to use boost::regex over your stock STL implementation. boost::regex is a very mature implementation and will probably give you far less trouble (the fact that it fails here means you could be just hitting the tip of the iceburg to issues).

jaredgrubb commented 8 years ago

To move forward: @mfrischknecht would you mind adjusting your patch to also select this implementation due to some "#if DOCTOPT_USE_WHATEVER" define. Then modify the Travis test case to also run the unit tests under this variant. I want to make sure (1) it passes all test cases and (2) that it will get exercised regularly in the future. (Travis only supports GCC and clang, but thankfully your patch should compile under both of those)

BurntSushi commented 8 years ago

But, come on, can you imagine if your text editor (or grep or Python or Perl) hitting a stack overflow if you type the wrong 50-character regex, and then the bug report came back as "you should optimize your regex"?

It's either that, or the regex could take years to terminate. (GNU grep specifically doesn't have this problem because it uses finite automata.) Optimizing regexes is pretty standard for backtracking engines. An entire book was written about it!

To move forward

What about just removing the regex impl that triggers the bug and only using the "traditional" approach? That way, you don't need to maintain two distinct routines.

mfrischknecht commented 8 years ago

@BurntSushi: As for your (a*)*b example: In my tests on MSVC, it also triggered a stack overflow.

But, come on, can you imagine if your text editor (or grep or Python or Perl) hitting a stack overflow if you type the wrong 50-character regex, and then the bug report came back as "you should optimize your regex"?

@jaredgrubb: I see your point there :-). I'm also not happy with the issue; I just would find it unfortunate if the whole library (which, again, is immensely useful!) couldn't be used with a major compiler because of two regex-expressions. And I would guess that introducing a new regex implementation for the standard library won't happen unless it's in a major update...

I've played with the idea of writing the two functions with iterator-based code and factor out some common patterns. Of course, this results in somewhat more complicated code, but I'd argue it is still understandable. What do you think? (I still have to make sure all unit tests pass, but currently, I've got no python implementation available that I could just fire up).

//Some common code for splitting and analyzing lines / indentation
class Line
{
public:
    typedef std::string::const_iterator iterator;

    Line(const std::string &str) :
        Line(str.begin(),str.end())
    {}
    Line(iterator begin, iterator end) :
        _begin(begin),
        _end(end)
    {}

    const iterator &begin() const { return _begin; }
    const iterator &end()   const { return _end; }

    iterator skip_indentation() const {
        auto is_intendation = [](char c) { return c == ' ' || c == '\t'; };
        return std::find_if_not(begin(), end(), is_intendation);
    }

    bool is_indented() const { return begin() != skip_indentation(); }
    bool starts_with(const std::string &str) const {
        auto start = skip_indentation();
        return start == std::search(start, end(), str.begin(), str.end());
    }
    bool starts_with(char c) const {
        auto start = skip_indentation();
        return start != end() && *start == c;
    }

private:
    iterator _begin;
    iterator _end;
};

std::vector<Line> split_lines(std::string::const_iterator i, std::string::const_iterator end) {
    std::vector<Line> lines;
    while (i != end) {
        auto line_end = std::find(i, end, '\n');
        if (line_end != end) ++line_end; //Skip past the '\n'
        lines.emplace_back(i, line_end);
        i = line_end;
    }
    return lines;
}
std::vector<Line> split_lines(const std::string &str) { return split_lines(str.begin(), str.end()); }

#include <cctype>

/*
...
*/

static std::vector<std::string> parse_section(std::string name, std::string const& source) {
    static const auto tolower = [](char c) { return std::tolower(c); };
    static const auto is_indented = [](const Line &l) { return l.is_indented(); };

    std::transform(name.begin(), name.end(), name.begin(), tolower); //Transform `name` to lower case
    auto is_header_line = [&](const Line &line) { //Perform a case-insensitive search for `name`
        static const auto matches_name = [&](char a, char b) { return tolower(a) == b; };
        return line.end() != std::search(line.begin(), line.end(), name.begin(), name.end(), matches_name);
    };

    std::vector<std::string> sections;
    auto lines = split_lines(source);
    auto line = std::find_if(lines.begin(), lines.end(), is_header_line);
    while (line != lines.end()) {
        auto section_start = line;
        line = std::find_if_not(line+1, lines.end(), is_indented); //Find the section end
        sections.push_back(trim({section_start->begin(),(line - 1)->end()}));
        line = std::find_if(line, lines.end(), is_header_line); //Find the next section
    }
    return sections;
}

static std::vector<Option> parse_defaults(std::string const& doc) {
    static const auto is_option_line = [](const Line &l) { return l.starts_with('-'); };

    std::vector<Option> options;
    for (auto &section : parse_section("options:",doc)) {
        //Get rid of "options:"
        auto colon = std::find(section.begin(), section.end(), ':');
        assert(colon != section.end());

        auto lines = split_lines(colon+1,section.end());
        auto line = std::find_if(lines.begin(), lines.end(), is_option_line);
        while(line != lines.end()) { 
            auto option_start = line;
            line = std::find_if(line+1, lines.end(), is_option_line);
            options.push_back(Option::parse({option_start->begin(),(line-1)->end()}));
        }
    }
    return options;
}
mfrischknecht commented 8 years ago

I received an answer on the bug posted above:

If you're looking at a stack trace, you may see _Match_pat many times; that function is used for backtracking. Basically, anywhere the engine may need to backtrack to results in an additional stack frame to store the backtrack position. Note that regexes of the form R"(.example.)" may require the engine to store backtrack state linear in the size of the input.

We are aware that our engine isn't the greatest in a number of places; using the machine stack for backtracking is one of those; but fixing that is not a "bugfix" and essentially requires a rewrite of the regex engine.

As a workaround, consider using Boost.Regex instead.

Also, I tested the code above on my private machine and did get all test cases to pass (it did take some minor changes though). W.r.t #42: The only other places where std::regex is used seem to be Option::parse and Tokens::from_pattern. I could try to also implement them in a similar way and wrap the changes in something like #if DOCOPT_NO_REGEX, which would aid both issues and avoid additional dependencies on boost...

Thomas-Haase commented 8 years ago

I' m using Microsoft Visual Studio 2015 Update 2 (14.0.25123.00 Update 2), and I see also the stack overflow problem, currently only in debug build and only when solution platform is 64 bit. The release builds and debug 32 bit build are working fine in my use case. I have the impression this issue in debug build is heavily related to the Microsoft iterator debugging feature. This special debugging can be easily turned off by setting the preprocessor define _HAS_ITERATOR_DEBUGGING=0; If doing so, the stack overflow disappears in my case. Can anybody confirm this behavior?

bobert130 commented 8 years ago

I can confirm that using the define to bypass the iterator debugging does avoid the problem in debug builds. THANKS!

define _HAS_ITERATOR_DEBUGGING 0

Solved this problem for me.

jaredgrubb commented 8 years ago

Are you propsing to #define that in the .cpp file? Or in the CMake file or? I'm open to that; I dont think disabling "iterator debugging" would bother anyone as long as it's local.

Thomas-Haase commented 8 years ago

I understand using this define as temporary workaround, not as solution of this issue. In the meantime, I have added some more options to my USAGE string, and the problem did come back. Again I solved the issue with a temporary workaround by using shorter text to explain my options to let it work again. But this approach obviously has no future, I still need more options. So I'm not proposing to add this define to one of your files (in fact I have added it in my *.vcxproj file). A real solution for this issue would still be welcome, it's really a beautiful command line interface, but unfortunately this issue makes it unusable on an important platform. My application is portable, and compiled with MS VS 2015 for Windows, and with gcc 5.4.0 on Ubuntu 16.04 LTS and with gcc 5.4.0 on AIX 6.1. With both gcc builds I have never seen this stack overflow so far.

jaredgrubb commented 8 years ago

Personally, I find Microsoft's closing of their bug a bit disturbing. Yes, it's always possible to invent a regex that will blow up an implementation if you want to craft one. But, the fact that it happens on a short non-purposely-crafted regex run against a short usage string is really a marker of a really bad implementation. I'm not windows-bashing here; I intentionally use boost::regex over libc++'s regex in one project because it reduced my tool's run time by half.

Anyway, I'm open to doing something. mfrischknecht's patch on June 1 seems like a reasonable workaround -- as long as it passes the tests. If someone wants to make a Pull Request and verify that the tests all pass, then let's go for it. (Please remove the #if MSVC though; if the impl is good enough for MSVC then we should use it everywhere. No point maintaining two versions without a good reason.)

Does that sound ok? Or does someone want to propose something else?

mfrischknecht commented 8 years ago

I've got another option for you, but you'll have to judge if the code is worth maintaining: After I opened the issue, I hacked for a couple of hours on a regex-free implementation (mostly to see if it's feasible at all in a readable way). I didn't finish it completely, though; I've still got to sort out 6 tests (the handling of default values, to be specific).

The prototype basically relies on a "poor man's string_view" that I defined in docopt_util.h. If you have a look at this diff, you'll see what I came up with so far. The code is a bit iterator-heavy, so it might not be everyone's cup of tea.

Do you think this is a usable alternative? I realize it makes the according functions less expressive, but on the plus side it would remove the dependency on regex-Libraries as a whole. If there's any interest, I could finish the work on my fork and issue a PR.

jaredgrubb commented 8 years ago

This library already requires C++11 or later, which implies that you have so I don't see that removing the dependency is the right reason to do it. If the code is easier to maintain, then I'm open to it.

jdukat commented 7 years ago

@mfrischknecht I vote you finish it. Seems like many compilers have problems with regexps and Microsoft doesn't see the problem with their implementation (submitted defect closed). You may just post a fork of this neat lib with regex-free implementation. I would use it.

jaredgrubb commented 7 years ago

MS does acknowledge the problem in their implementation, but they're unwilling to fix it. No other regex implementation has this issue (none reported anyway). I'm open to a patch to workaround the issue. My requirement for a patch is that (1) it passes all the tests and (2) we use the same approach for all platforms so we only maintain one code path.

jdukat commented 7 years ago

maybe they don't know how? ;-)

grasmanek94 commented 7 years ago

The bug occurs in VS2017 too.

Alternatively, I defined DOCTOPT_USE_BOOST_REGEX as ( _DEBUG && _WIN64 && _HAS_ITERATOR_DEBUGGING != 0 )

Works for me.

giumas commented 7 years ago

It happens to me also in Release mode, not only in Debug. The compiler is VS2015 Update 3.

isometriq commented 7 years ago

I have a weird issue with crashes on regex (without boost)... I have about 18 options in my USAGE. The main synopsis is very simple as it uses the [options] token.

As soon has I have over specifically more than 14 options (individual lines), i get a crash. If I remove any options down 14 and less, then it works.

I tried different in case it was a syntax error, but most of them are boolean flags.

Could that problem be related to the one discussed here? Is there some kind of hard-limit related to regex, or maybe the R"(...)"; syntax ?

jdukat commented 7 years ago

Very likely related. I've seen it work/not work as well depending on number of rules. Definitely recommend moving to boost regex, both for stability and performance.

isometriq commented 7 years ago

yeah i just saw your closed issue, pretty similar.. well, i guess it's boost-time

jaredgrubb commented 4 years ago

I think moving to boost::regex is the right way to fix this. I'm going to close this as "fixed" for that reason.