Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.91k stars 542 forks source link

Possible regexp memory explosion in 5.20.0 #13984

Closed p5pRT closed 9 years ago

p5pRT commented 10 years ago

Migrated from rt.perl.org#122283 (status was 'resolved')

Searchable as RT122283$

p5pRT commented 10 years ago

From @hvds

Created by @hvds

I've been experimenting with an attempt to take a SQL grammar expressed in BNF and convert it (programmatically) into something that can parse SQL with it as a Regexp​::Grammars (v1.035) grammar.

The code below is (60%) cut down from an interim stage in that process; this reaches about 10MB process size under perl-5.16.3; under perl-5.20.0 it grows to over 1GB. Cutting down the grammar rule by rule does gradually reduce the memory use\, but it remains a high multiple of the memory use under perl-5.16.3\, and I've not yet found any smoking gun; I've included the full 200-odd lines here rather than risk eliding something important.

Damain and I are looking into it\, but he suggested I perlbug it as a heads-up of a possible problem in 5.20\, likely of interest to davem as potentially relating to regexp engine changes.

zen% ulimit -v # I've set a 1GB process-size limit 1000000 zen% /usr/bin/time /opt/perl-5.16.3/bin/perl ./t0 # top(1) shows peak 10MB VIRT ok 8.52user 0.01system 0​:08.54elapsed 99%CPU (0avgtext+0avgdata 34816maxresident)k 0inputs+0outputs (0major+2331minor)pagefaults 0swaps zen% /usr/bin/time /opt/perl-5.20.0/bin/perl ./t0 Out of memory! Command exited with non-zero status 1 41.59user 2.10system 0​:43.83elapsed 99%CPU (0avgtext+0avgdata 3641344maxresident)k 0inputs+0outputs (0major+228082minor)pagefaults 0swaps zen% cat t0 #!/opt/perl-5.20.0/bin/perl use strict; use warnings; use Regexp​::Grammars;

my $g = qr{ ^ \<query_specification> $

\<rule​: simple_Latin_letter> \<simple_Latin_upper_case_letter> | \<simple_Latin_lower_case_letter> \<token​: simple_Latin_upper_case_letter> A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z \<token​: simple_Latin_lower_case_letter> a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z \<token​: digit> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 \<token​: double_quote> \" \<token​: left_paren> \( \<token​: right_paren> \) \<token​: asterisk> \* \<token​: plus_sign> \+ \<token​: comma> \\, \<token​: minus_sign> \- \<token​: period> \. \<token​: solidus> \/ \<token​: less_than_operator> \\< \<token​: equals_operator> \= \<token​: greater_than_operator> \> \<token​: question_mark> \? \<token​: underscore> _ \<rule​: regular_identifier> \<identifier_body> \<rule​: identifier_body> \<identifier_start> (?​:\ | \<identifier_part>)* \<token​: identifier_start> \w \<rule​: identifier_part> \<identifier_start> | \ \<rule​: unsigned_integer> [0-9]+ \<rule​: sign> \<plus_sign> | \<minus_sign> \<rule​: introducer> \ \<rule​: character_set_specification> \<standard_character_repertoire_name> | \<implementation_dash_defined_character_repertoire_name> | \<user_dash_defined_character_repertoire_name> | \<standard_universal_character_form_dash_of_dash_use_name> | \<implementation_dash_defined_universal_character_form_dash_of_dash_use_name> \<rule​: standard_character_repertoire_name> \<character_set_name> \<rule​: character_set_name> (?​:\<schema_name> \)? \<SQL_language_identifier> \<rule​: schema_name> (?​:\<catalog_name> \)? \<unqualified_schema_name> \<rule​: catalog_name> \ \<rule​: identifier> (?​:\ \<character_set_specification>)? \<actual_identifier> \<rule​: actual_identifier> \<regular_identifier> \<token​: nondoublequote_character> [^"] \<rule​: doublequote_symbol> \<double_quote> \<double_quote> \<rule​: unqualified_schema_name> \ \<rule​: SQL_language_identifier> \<SQL_language_identifier_start> (?​:\ | \<SQL_language_identifier_part>)* \<rule​: SQL_language_identifier_start> \<simple_Latin_letter> \<rule​: SQL_language_identifier_part> \<simple_Latin_letter> | \ \<rule​: implementation_dash_defined_character_repertoire_name> \<character_set_name> \<rule​: user_dash_defined_character_repertoire_name> \<character_set_name> \<rule​: standard_universal_character_form_dash_of_dash_use_name> \<character_set_name> \<rule​: implementation_dash_defined_universal_character_form_dash_of_dash_use_name> \<character_set_name> \<token​: not_equals_operator> \\<\> \<token​: greater_than_or_equals_operator> \>\= \<token​: less_than_or_equals_operator> \\<\= \<token​: concatenation_operator> \|\| \<rule​: qualified_local_table_name> MODULE \ \<local_table_name> \<rule​: local_table_name> \<qualified_identifier> \<rule​: qualified_identifier> \ \<rule​: column_name> \ \<rule​: time_precision> \<time_fractional_seconds_precision> \<rule​: time_fractional_seconds_precision> \<unsigned_integer> \<rule​: timestamp_precision> \<time_fractional_seconds_precision> \<rule​: interval_qualifier> \<start_field> TO \<end_field> | \<single_datetime_field> \<rule​: start_field> \<non_dash_second_datetime_field> (?​:\<left_paren> \<interval_leading_field_precision> \<right_paren>)? \<token​: non_dash_second_datetime_field> YEAR | MONTH | DAY | HOUR | MINUTE \<rule​: interval_leading_field_precision> \<unsigned_integer> \<rule​: end_field> \<non_dash_second_datetime_field> | SECOND (?​:\<left_paren> \<interval_fractional_seconds_precision> \<right_paren>)? \<rule​: interval_fractional_seconds_precision> \<unsigned_integer> \<rule​: single_datetime_field> \<non_dash_second_datetime_field> (?​:\<left_paren> \<interval_leading_field_precision> \<right_paren>)? | SECOND (?​:\<left_paren> \<interval_leading_field_precision> (?​:\ \<left_paren> \<interval_fractional_seconds_precision>)? \<right_paren>)? \<rule​: qualified_name> (?​:\<schema_name> \)? \<qualified_identifier> \<rule​: datetime_value_function> \<current_date_value_function> | \<current_time_value_function> | \<current_timestamp_value_function> \<token​: current_date_value_function> CURRENT_DATE \<rule​: current_time_value_function> CURRENT_TIME (?​:\<left_paren> \<time_precision> \<right_paren>)? \<rule​: current_timestamp_value_function> CURRENT_TIMESTAMP (?​:\<left_paren> \<timestamp_precision> \<right_paren>)? \<rule​: table_name> \<qualified_name> | \<qualified_local_table_name> \<rule​: column_name_list> \<column_name> (?​:\ \<column_name>)* \<rule​: search_condition> \<in_predicate> \<rule​: boolean_term> \<boolean_factor> (?​:AND \<boolean_factor>)* \<rule​: boolean_factor> (?​:NOT)? \<boolean_test> \<rule​: boolean_test> \<boolean_primary> (?​:IS (?​:NOT)? \<truth_value>)? \<rule​: boolean_primary> \ | \<left_paren> \<search_condition> \<right_paren> \<rule​: predicate> \<comparison_predicate> | \<between_predicate> | \<in_predicate> | \<like_predicate> | \<null_predicate> | \<quantified_comparison_predicate> | \<exists_predicate> | \<match_predicate> | \<overlaps_predicate> \<rule​: comparison_predicate> \<row_value_constructor> \<comp_op> \<row_value_constructor> \<rule​: row_value_constructor> \<row_value_constructor_element> | \<left_paren> \<row_value_constructor_list> \<right_paren> | \<row_subquery> \<rule​: row_value_constructor_element> \<value_expression> | \<null_specification> | \<default_specification> \<rule​: value_expression> \<numeric_value_expression> | \<string_value_expression> | \<datetime_value_expression> | \<interval_value_expression> \<rule​: numeric_value_expression> \ (?​:\<plus_sign> \ | \<minus_sign> \)* \<rule​: term> \ (?​:\ \ | \ \)* \<rule​: factor> \? \<numeric_primary> \<rule​: numeric_primary> \<value_expression_primary> | \<numeric_value_function> \<rule​: value_expression_primary> \<question_mark> | \<column_reference> \<rule​: column_reference> (?​:\ \)? \<column_name> \<rule​: qualifier> \<table_name> | \<correlation_name> \<rule​: correlation_name> \ \<token​: set_quantifier> DISTINCT | ALL \<rule​: subquery> \<left_paren> \<query_expression> \<right_paren> \<rule​: query_expression> \<non_dash_join_query_expression> | \<joined_table> \<rule​: non_dash_join_query_expression> (?​:\<non_dash_join_query_term> | \<joined_table> UNION (?​:ALL)? \<corresponding_spec>? \<query_term> | \<joined_table> EXCEPT (?​:ALL)? \<corresponding_spec>? \<query_term>) (?​:UNION (?​:ALL)? \<corresponding_spec>? \<query_term> | EXCEPT (?​:ALL)? \<corresponding_spec>? \<query_term>)* \<rule​: non_dash_join_query_term> (?​:\<non_dash_join_query_primary> | \<joined_table> INTERSECT (?​:ALL)? \<corresponding_spec>? \<query_primary>) (?​:INTERSECT (?​:ALL)? \<corresponding_spec>? \<query_primary>)* \<rule​: non_dash_join_query_primary> \<simple_table> | \<left_paren> \<non_dash_join_query_expression> \<right_paren> \<rule​: simple_table> \<query_specification> | \<table_value_constructor> | \<explicit_table> \<rule​: query_specification> SELECT \<set_quantifier>? \<select_list> \<table_expression> \<rule​: select_list> \ | \<select_sublist> (?​:\ \<select_sublist>)* \<rule​: select_sublist> \<derived_column> | \ \ \ \<rule​: derived_column> \<value_expression> \<as_clause>? \<rule​: as_clause> (?​:AS)? \<column_name> \<rule​: table_expression> \<from_clause> \<where_clause>? \<rule​: from_clause> FROM \<table_reference> (?​:\ \<table_reference>)* \<rule​: table_reference> \<table_name> \<rule​: table_subquery> \ \<rule​: joined_table> \<cross_join> | \<qualified_join> | \<left_paren> \<joined_table> \<right_paren> \<rule​: cross_join> \<table_reference> CROSS JOIN \<table_reference> \<rule​: qualified_join> \<table_reference> (?​:NATURAL)? \<join_type>? JOIN \<table_reference> \<join_specification>? \<rule​: join_type> INNER | \<outer_join_type> (?​:OUTER)? | UNION \<token​: outer_join_type> LEFT | RIGHT | FULL \<rule​: join_specification> \<join_condition> | \<named_columns_join> \<rule​: join_condition> ON \<search_condition> \<rule​: named_columns_join> USING \<left_paren> \<join_column_list> \<right_paren> \<rule​: join_column_list> \<column_name_list> \<rule​: where_clause> WHERE \<search_condition> \<rule​: collate_clause> COLLATE \<collation_name> \<rule​: collation_name> \<qualified_name> \<rule​: table_value_constructor> VALUES \<table_value_constructor_list> \<rule​: table_value_constructor_list> \<row_value_constructor> (?​:\ \<row_value_constructor>)* \<rule​: explicit_table> TABLE \<table_name> \<rule​: query_term> \<non_dash_join_query_term> | \<joined_table> \<rule​: corresponding_spec> CORRESPONDING (?​:BY \<left_paren> \<corresponding_column_list> \<right_paren>)? \<rule​: corresponding_column_list> \<column_name_list> \<rule​: query_primary> \<non_dash_join_query_primary> | \<joined_table> \<rule​: numeric_value_function> \<position_expression> | \<extract_expression> | \<length_expression> \<rule​: position_expression> POSITION \<left_paren> \<character_value_expression> IN \<character_value_expression> \<right_paren> \<rule​: character_value_expression> \<character_factor> (?​:\<concatenation_operator> \<character_factor>)* \<rule​: concatenation> \<character_value_expression> \<concatenation_operator> \<character_factor> \<rule​: character_factor> \<character_primary> \<collate_clause>? \<rule​: character_primary> \<value_expression_primary> | \<string_value_function> \<rule​: string_value_function> \<character_value_function> | \<bit_value_function> \<rule​: character_value_function> \<character_substring_function> | \ | \<form_dash_of_dash_use_conversion> | \<character_translation> | \<trim_function> \<rule​: character_substring_function> SUBSTRING \<left_paren> \<character_value_expression> FROM \<start_position> (?​:FOR \<string_length>)? \<right_paren> \<rule​: start_position> \<numeric_value_expression> \<rule​: string_length> \<numeric_value_expression> \<rule​: fold> (?​:UPPER | LOWER) \<left_paren> \<character_value_expression> \<right_paren> \<rule​: form_dash_of_dash_use_conversion> CONVERT \<left_paren> \<character_value_expression> USING \<form_dash_of_dash_use_conversion_name> \<right_paren> \<rule​: form_dash_of_dash_use_conversion_name> \<qualified_name> \<rule​: character_translation> TRANSLATE \<left_paren> \<character_value_expression> USING \<translation_name> \<right_paren> \<rule​: translation_name> \<qualified_name> \<rule​: trim_function> TRIM \<left_paren> \<trim_operands> \<right_paren> \<rule​: trim_operands> (?​:\<trim_specification>? \<trim_character>? FROM)? \<trim_source> \<token​: trim_specification> LEADING | TRAILING | BOTH \<rule​: trim_character> \<character_value_expression> \<rule​: trim_source> \<character_value_expression> \<rule​: bit_value_function> \<bit_substring_function> \<rule​: bit_substring_function> SUBSTRING \<left_paren> \<bit_value_expression> FROM \<start_position> (?​:FOR \<string_length>)? \<right_paren> \<rule​: bit_value_expression> \<bit_factor> (?​:\<concatenation_operator> \<bit_factor>)* \<rule​: bit_concatenation> \<bit_value_expression> \<concatenation_operator> \<bit_factor> \<rule​: bit_factor> \<bit_primary> \<rule​: bit_primary> \<value_expression_primary> | \<string_value_function> \<rule​: extract_expression> EXTRACT \<left_paren> \<extract_field> FROM \<extract_source> \<right_paren> \<rule​: extract_field> \<datetime_field> | \<time_zone_field> \<rule​: datetime_field> \<non_dash_second_datetime_field> | SECOND \<token​: time_zone_field> TIMEZONE_HOUR | TIMEZONE_MINUTE \<rule​: extract_source> \<datetime_value_expression> | \<interval_value_expression> \<rule​: datetime_value_expression> (?​:\<datetime_term> | \<interval_value_expression> \<plus_sign> \<datetime_term>) (?​:\<plus_sign> \<interval_term> | \<minus_sign> \<interval_term>)* \<rule​: interval_term> (?​:\<interval_factor> | \ \ \<interval_factor>) (?​:\ \ | \ \)* \<rule​: interval_factor> \? \<interval_primary> \<rule​: interval_primary> \<value_expression_primary> \<interval_qualifier>? \<rule​: interval_term_2> \<interval_term> \<rule​: interval_value_expression> (?​:\<interval_term> | \<left_paren> \<datetime_value_expression> \<minus_sign> \<datetime_term> \<right_paren> \<interval_qualifier>) (?​:\<plus_sign> \<interval_term_1> | \<minus_sign> \<interval_term_1>)* \<rule​: interval_value_expression_1> \<interval_value_expression> \<rule​: interval_term_1> \<interval_term> \<rule​: datetime_term> \<datetime_factor> \<rule​: datetime_factor> \<datetime_primary> \<time_zone>? \<rule​: datetime_primary> \<value_expression_primary> | \<datetime_value_function> \<rule​: time_zone> AT \<time_zone_specifier> \<rule​: time_zone_specifier> LOCAL | TIME ZONE \<interval_value_expression> \<rule​: length_expression> \<char_length_expression> | \<octet_length_expression> | \<bit_length_expression> \<rule​: char_length_expression> (?​:CHAR_LENGTH | CHARACTER_LENGTH) \<left_paren> \<string_value_expression> \<right_paren> \<rule​: string_value_expression> \<character_value_expression> | \<bit_value_expression> \<rule​: octet_length_expression> OCTET_LENGTH \<left_paren> \<string_value_expression> \<right_paren> \<rule​: bit_length_expression> BIT_LENGTH \<left_paren> \<string_value_expression> \<right_paren> \<token​: null_specification> NULL \<token​: default_specification> DEFAULT \<rule​: row_value_constructor_list> \<row_value_constructor_element> (?​:\ \<row_value_constructor_element>)* \<rule​: row_subquery> \ \<rule​: comp_op> \<equals_operator> | \<not_equals_operator> | \<less_than_operator> | \<greater_than_operator> | \<less_than_or_equals_operator> | \<greater_than_or_equals_operator> \<rule​: between_predicate> \<row_value_constructor> (?​:NOT)? BETWEEN \<row_value_constructor> AND \<row_value_constructor> \<rule​: in_predicate> \<row_value_constructor> (?​:NOT)? IN \<in_predicate_value> \<rule​: in_predicate_value> \<left_paren> \<in_value_list> \<right_paren> \<rule​: in_value_list> \<question_mark> (?​:\ \<question_mark>)+ \<rule​: like_predicate> \<match_value> (?​:NOT)? LIKE \ (?​:ESCAPE \<escape_character>)? \<rule​: match_value> \<character_value_expression> \<rule​: pattern> \<character_value_expression> \<rule​: escape_character> \<character_value_expression> \<token​: null_predicate> IS (?​:NOT)? NULL \<rule​: quantified_comparison_predicate> \<row_value_constructor> \<comp_op> \ \<table_subquery> \<rule​: quantifier> \ | \ \<token​: all> ALL \<token​: some> SOME | ANY \<rule​: exists_predicate> EXISTS \<table_subquery> \<rule​: match_predicate> \<row_value_constructor> MATCH (?​:UNIQUE)? (?​:PARTIAL | FULL)? \<table_subquery> \<rule​: overlaps_predicate> \<row_value_constructor_1> OVERLAPS \<row_value_constructor_2> \<rule​: row_value_constructor_1> \<row_value_constructor> \<rule​: row_value_constructor_2> \<row_value_constructor> \<token​: truth_value> TRUE | FALSE | UNKNOWN

}x; print "ok\n"; zen%

Perl Info ``` Flags: category=core severity=medium Site configuration information for perl 5.20.0: Configured by hv at Thu Jun 26 22:38:47 BST 2014. Summary of my perl5 (revision 5 version 20 subversion 0) configuration: Platform: osname=linux, osvers=2.6.32-46-generic, archname=i686-linux uname='linux shad 2.6.32-46-generic #108-ubuntu smp thu apr 11 15:55:01 utc 2013 i686 gnulinux ' config_args='-des -Dprefix=/opt/perl-5.20.0 -Doptimize=-g -O6 -Dusedevel -Uversiononly' hint=recommended, useposix=true, d_sigaction=define useithreads=undef, usemultiplicity=undef use64bitint=undef, use64bitall=undef, uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64', optimize='-g -O6', cppflags='-fwrapv -fno-strict-aliasing -pipe -fstack-protector -I/usr/local/include' ccversion='', gccversion='4.4.3', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='cc', ldflags =' -fstack-protector -L/usr/local/lib' libpth=/usr/local/lib /usr/lib/gcc/i486-linux-gnu/4.4.3/include-fixed /usr/include/i486-linux-gnu /usr/lib /lib/../lib /usr/lib/../lib /lib /usr/lib/i486-linux-gnu /usr/lib64 libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc libc=libc-2.11.1.so, so=so, useshrplib=false, libperl=libperl.a gnulibc_version='2.11.1' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E' cccdlflags='-fPIC', lddlflags='-shared -g -O6 -L/usr/local/lib -fstack-protector' @INC for perl 5.20.0: /opt/perl-5.20.0/lib/site_perl/5.20.0/i686-linux /opt/perl-5.20.0/lib/site_perl/5.20.0 /opt/perl-5.20.0/lib/5.20.0/i686-linux /opt/perl-5.20.0/lib/5.20.0 . Environment for perl 5.20.0: HOME=/home/hv LANG=C LANGUAGE (unset) LD_LIBRARY_PATH (unset) LOGDIR (unset) PATH=/home/hv/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games PERL_BADLANG (unset) SHELL=/bin/bash ```
p5pRT commented 10 years ago

From DCONWAY@cpan.org

Further to Hugo's report...

I have now had the opportunity to investigate the problem\, and have concluded that this has nothing to do with Regexp​::Grammars per se\, except that R​::G is generating the enormous regex that 5.20 is failing to compile.

The attached example (constructed by removing all the R​::G syntactic sugar from Hugo's original) does not make use of Regex​::Grammars at all\, and still leaks endlessly under 5.20...whereas it compiles repidly and without complaint under all of​:

  5.10.1   5.12.5   5.14.4   5.16.3   5.18.2

I hope this additional information may be of help in tracking down the regression.

Damian

p5pRT commented 10 years ago

From DCONWAY@cpan.org

Perl5.20_regex_compilation_problem.pl

p5pRT commented 10 years ago

The RT System itself - Status changed from 'new' to 'open'

p5pRT commented 10 years ago

From @arc

Damian Conway via RT \perlbug\-followup@&#8203;perl\.org wrote​:

I have now had the opportunity to investigate the problem\, and have concluded that this has nothing to do with Regexp​::Grammars per se\, except that R​::G is generating the enormous regex that 5.20 is failing to compile.

Yes\, it looks that way to me too. Thanks for supplying that reduction.

The file attached cuts down this regex further still\, by removing all the embedded code blocks\, and the various DEFINEs whose names begin "______0_88".

The symptoms I observed seem to be the same\, though I also get a "panic​: memory wrap" error (apparently when passing 4 GiB of allocated memory).

In blead\, it looks like the immediate culprit is study_chunk() — it starts 185 ms after the start of the sizing pass\, and I haven't yet had the patience to let it run long enough to finish.

-- Aaron Crane ** http​://aaroncrane.co.uk/

p5pRT commented 10 years ago

From @arc

large_rx.pl

p5pRT commented 10 years ago

From @iabyn

On Mon\, Jul 14\, 2014 at 12​:34​:06AM +0100\, Aaron Crane wrote​:

Damian Conway via RT \perlbug\-followup@&#8203;perl\.org wrote​:

I have now had the opportunity to investigate the problem\, and have concluded that this has nothing to do with Regexp​::Grammars per se\, except that R​::G is generating the enormous regex that 5.20 is failing to compile.

Yes\, it looks that way to me too. Thanks for supplying that reduction.

The file attached cuts down this regex further still\, by removing all the embedded code blocks\, and the various DEFINEs whose names begin "______0_88".

The symptoms I observed seem to be the same\, though I also get a "panic​: memory wrap" error (apparently when passing 4 GiB of allocated memory).

In blead\, it looks like the immediate culprit is study_chunk() — it starts 185 ms after the start of the sizing pass\, and I haven't yet had the patience to let it run long enough to finish.

It bisects to the following. Yves...?

commit 099ec7dcf9e085a650e6d9010c12ad9649209bf4 Author​: Yves Orton \demerphq@&#8203;gmail\.com Date​: Fri Nov 22 01​:08​:39 2013 +0100

  Fix RT #120600​: Variable length lookbehind is not variable  
  Inside of study_chunk() we have to guard against infinite   recursion with recursive subpatterns. The existing logic   sort of worked\, but didn't address all cases properly.  
  qr/   (?\a)   (?\   (?=(?&W))(?\<=(?&W))   )   (?&BB)   /x;  
  The pattern in the test would fail when the optimizer   was expanding (&BB). When it recursed\, it creates a bitmap   for the recursion it performs\, it then jumps back to   the BB node and then eventually does the first (&W) call.   At this point the bit for (&W) would be set in the bitmask.   When the recursion for the (&W) exited (fake exit through   the study frame logic) the bit was not /unset/. When the parser   then entered the (&W) again it was treated as a nested and   potentially infinite length pattern.  
  The fake-recursion in study-chunk made it little less obvious   what was going on in the debug output.  
  By reorganizing the code and adding logic to unset the bitmap   when exiting this bug was fixed. Unfortunately this also revealed   another little issue with patterns like this​:  
  qr/x|(?0)/   qr/(x|(?1))/  
  which forced the creation of a new bitmask for each branch.   Effectively study_chunk treats each branch as an independent   pattern\, so when we are expanding (?1) via the 'x' branch   we dont want that to prevent us from detecting the infinite recursion   in the (?1) branch. If you were to think of trips through study_chunk   as paths\, and [] as recursive processing you would get something like​:  
  BRANCH 'x' END   BRANCH (?0) [ 'x' END ]   BRANCH (?0) [ (?0) [ 'x' END ] ]   ...  
  When we want something like​:  
  BRANCH 'x' END   BRANCH (?0) [ 'x' END ]   BRANCH (?0) [ (?0) INFINITE_RECURSION ]  
  So when we deal with a branch we need to make a new recursion bitmask.

-- "Foul and greedy Dwarf - you have eaten the last candle."   -- "Hordes of the Things"\, BBC Radio.

p5pRT commented 10 years ago

From @demerphq

The only useful thing I have to add /right now/ is that I am glad I wrote a decent commit message. :-)

On 14 July 2014 12​:13\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

On Mon\, Jul 14\, 2014 at 12​:34​:06AM +0100\, Aaron Crane wrote​:

Damian Conway via RT \perlbug\-followup@&#8203;perl\.org wrote​:

I have now had the opportunity to investigate the problem\, and have concluded that this has nothing to do with Regexp​::Grammars per se\, except that R​::G is generating the enormous regex that 5.20 is failing to compile.

Yes\, it looks that way to me too. Thanks for supplying that reduction.

The file attached cuts down this regex further still\, by removing all the embedded code blocks\, and the various DEFINEs whose names begin "______0_88".

The symptoms I observed seem to be the same\, though I also get a "panic​: memory wrap" error (apparently when passing 4 GiB of allocated memory).

In blead\, it looks like the immediate culprit is study_chunk() — it starts 185 ms after the start of the sizing pass\, and I haven't yet had the patience to let it run long enough to finish.

It bisects to the following. Yves...?

commit 099ec7dcf9e085a650e6d9010c12ad9649209bf4 Author​: Yves Orton \demerphq@&#8203;gmail\.com Date​: Fri Nov 22 01​:08​:39 2013 +0100

Fix RT \#120600&#8203;: Variable length lookbehind is not variable

Inside of study\_chunk\(\) we have to guard against infinite
recursion with recursive subpatterns\. The existing logic
sort of worked\, but didn't address all cases properly\.

  qr/
    \(?\<W>a\)
    \(?\<BB>
      \(?=\(?&W\)\)\(?\<=\(?&W\)\)
    \)
    \(?&BB\)
  /x;

The pattern in the test would fail when the optimizer
was expanding \(&BB\)\. When it recursed\, it creates a bitmap
for the recursion it performs\, it then jumps back to
the BB node and then eventually does the first \(&W\) call\.
At this point the bit for \(&W\) would be set in the bitmask\.
When the recursion for the \(&W\) exited \(fake exit through
the study frame logic\) the bit was not /unset/\. When the parser
then entered the \(&W\) again it was treated as a nested and
potentially infinite length pattern\.

The fake\-recursion in study\-chunk made it little less obvious
what was going on in the debug output\.

By reorganizing the code and adding logic to unset the bitmap
when exiting this bug was fixed\. Unfortunately this also revealed
another little issue with patterns like this&#8203;:

  qr/x|\(?0\)/
  qr/\(x|\(?1\)\)/

which forced the creation of a new bitmask for each branch\.
Effectively study\_chunk treats each branch as an independent
pattern\, so when we are expanding \(?1\) via the 'x' branch
we dont want that to prevent us from detecting the infinite recursion
in the \(?1\) branch\. If you were to think of trips through study\_chunk
as paths\, and \[\] as recursive processing you would get something like&#8203;:

  BRANCH 'x' END
  BRANCH \(?0\) \[ 'x' END \]
  BRANCH \(?0\) \[ \(?0\) \[ 'x' END \] \]
  \.\.\.

When we want something like&#8203;:

  BRANCH 'x' END
  BRANCH \(?0\) \[ 'x' END \]
  BRANCH \(?0\) \[ \(?0\) INFINITE\_RECURSION \]

So when we deal with a branch we need to make a new recursion bitmask\.

-- "Foul and greedy Dwarf - you have eaten the last candle." -- "Hordes of the Things"\, BBC Radio.

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 10 years ago

From @khwilliamson

On 07/14/2014 04​:13 AM\, Dave Mitchell wrote​:

It bisects to the following

I'm curious as to how you bisected this. When I tried running Aaron's script on my machine\, it quickly ate up all the memory available. What I was planning to do to bisect it was to add a call to setrlimit() to perlmain.c to cause it to die when it used up a much smaller amount of memory\, long before my machine starts thrashing. But perhaps you have a better way that would be educational for me and others to hear about.

p5pRT commented 10 years ago

From @iabyn

On Mon\, Jul 14\, 2014 at 12​:15​:46PM -0600\, Karl Williamson wrote​:

On 07/14/2014 04​:13 AM\, Dave Mitchell wrote​:

It bisects to the following

I'm curious as to how you bisected this. When I tried running Aaron's script on my machine\, it quickly ate up all the memory available. What I was planning to do to bisect it was to add a call to setrlimit() to perlmain.c to cause it to die when it used up a much smaller amount of memory\, long before my machine starts thrashing. But perhaps you have a better way that would be educational for me and others to hear about.

I just started a new shell and did

  $ ulimit -v 500000

then ran the bisect.

(I had to experiment for a minute or so to find a suitable limit that ran ok on 5.18.0 and died quickly on 5.2.0.)

-- My get-up-and-go just got up and went.

p5pRT commented 10 years ago

From dana@acm.org

A tool I found useful for this is massif from the valgrind suite\, e.g.​:

valgrind --tool=massif --stacks=yes --alloc-fn=Perl_safesysmalloc --alloc-fn=Perl_safesyscalloc --alloc-fn=Perl_safesysrealloc --alloc-fn=Perl_Slab_Alloc --time-unit=B --max-snapshots=1000 perl -MMath​::Prime​::XS=​:all -E 'say 1'

(in this case\, load up a module and do basically nothing else)\, then​:

massif-visualizer massif.out.#### [#### depending on the file]

to see the graphical results.

This shows\, for example\, a memory spike from Perl__invlist_union_maybe_complement_2nd that shows up in 5.20.0 and 5.21.2 that is not in 5.19.7 when processing constant.pm's​:

  my $normal_constant_name = qr/^_?[^\W_0-9]\w*\z/;

which means it hits lots of modules. Tracking down the Perl source that causes given memory behavior is not very straightforward\, but I think the tool is pretty valuable for seeing how memory is being used\, and what is causing the use\, over time.

p5pRT commented 10 years ago

From @khwilliamson

On 07/22/2014 12​:29 PM\, Dana Jacobsen via RT wrote​:

A tool I found useful for this is massif from the valgrind suite\, e.g.​:

valgrind --tool=massif --stacks=yes --alloc-fn=Perl_safesysmalloc --alloc-fn=Perl_safesyscalloc --alloc-fn=Perl_safesysrealloc --alloc-fn=Perl_Slab_Alloc --time-unit=B --max-snapshots=1000 perl -MMath​::Prime​::XS=​:all -E 'say 1'

(in this case\, load up a module and do basically nothing else)\, then​:

massif-visualizer massif.out.#### [#### depending on the file]

to see the graphical results.

This shows\, for example\, a memory spike from Perl__invlist_union_maybe_complement_2nd that shows up in 5.20.0 and 5.21.2 that is not in 5.19.7 when processing constant.pm's​:

my $normal_constant_name = qr/^_?[^\W_0-9]\w*\z/;

which means it hits lots of modules. Tracking down the Perl source that causes given memory behavior is not very straightforward\, but I think the tool is pretty valuable for seeing how memory is being used\, and what is causing the use\, over time.

--- via perlbug​: queue​: perl5 status​: open https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122283

The memory spike occurs when taking the union of two lists. At the beginning\, it allocates enough memory for the worst case scenario\, in which the lists are completely disjoint\, so the memory required is the sum of the memory required by each list. At the end\, the resultant list is trimmed to the actual amount used.

This is done to avoid having to ask for extra memory in the middle of the operation and potentially have to do extra copies.

p5pRT commented 10 years ago

From @demerphq

On 14 July 2014 12​:13\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

On Mon\, Jul 14\, 2014 at 12​:34​:06AM +0100\, Aaron Crane wrote​:

Damian Conway via RT \perlbug\-followup@&#8203;perl\.org wrote​:

I have now had the opportunity to investigate the problem\, and have concluded that this has nothing to do with Regexp​::Grammars per se\, except that R​::G is generating the enormous regex that 5.20 is failing to compile.

Yes\, it looks that way to me too. Thanks for supplying that reduction.

The file attached cuts down this regex further still\, by removing all the embedded code blocks\, and the various DEFINEs whose names begin "______0_88".

The symptoms I observed seem to be the same\, though I also get a "panic​: memory wrap" error (apparently when passing 4 GiB of allocated memory).

In blead\, it looks like the immediate culprit is study_chunk() — it starts 185 ms after the start of the sizing pass\, and I haven't yet had the patience to let it run long enough to finish.

It bisects to the following. Yves...?

commit 099ec7dcf9e085a650e6d9010c12ad9649209bf4 Author​: Yves Orton \demerphq@&#8203;gmail\.com Date​: Fri Nov 22 01​:08​:39 2013 +0100

Fix RT \#120600&#8203;: Variable length lookbehind is not variable

Inside of study\_chunk\(\) we have to guard against infinite
recursion with recursive subpatterns\. The existing logic
sort of worked\, but didn't address all cases properly\.

  qr/
    \(?\<W>a\)
    \(?\<BB>
      \(?=\(?&W\)\)\(?\<=\(?&W\)\)
    \)
    \(?&BB\)
  /x;

The pattern in the test would fail when the optimizer
was expanding \(&BB\)\. When it recursed\, it creates a bitmap
for the recursion it performs\, it then jumps back to
the BB node and then eventually does the first \(&W\) call\.
At this point the bit for \(&W\) would be set in the bitmask\.
When the recursion for the \(&W\) exited \(fake exit through
the study frame logic\) the bit was not /unset/\. When the parser
then entered the \(&W\) again it was treated as a nested and
potentially infinite length pattern\.

The fake\-recursion in study\-chunk made it little less obvious
what was going on in the debug output\.

By reorganizing the code and adding logic to unset the bitmap
when exiting this bug was fixed\. Unfortunately this also revealed
another little issue with patterns like this&#8203;:

  qr/x|\(?0\)/
  qr/\(x|\(?1\)\)/

which forced the creation of a new bitmask for each branch\.
Effectively study\_chunk treats each branch as an independent
pattern\, so when we are expanding \(?1\) via the 'x' branch
we dont want that to prevent us from detecting the infinite recursion
in the \(?1\) branch\. If you were to think of trips through study\_chunk
as paths\, and \[\] as recursive processing you would get something like&#8203;:

  BRANCH 'x' END
  BRANCH \(?0\) \[ 'x' END \]
  BRANCH \(?0\) \[ \(?0\) \[ 'x' END \] \]
  \.\.\.

When we want something like&#8203;:

  BRANCH 'x' END
  BRANCH \(?0\) \[ 'x' END \]
  BRANCH \(?0\) \[ \(?0\) INFINITE\_RECURSION \]

So when we deal with a branch we need to make a new recursion bitmask\.

I will try to find some time for this.

Yves

p5pRT commented 10 years ago

From @demerphq

On 14 July 2014 12​:13\, Dave Mitchell \davem@&#8203;iabyn\.com wrote​:

On Mon\, Jul 14\, 2014 at 12​:34​:06AM +0100\, Aaron Crane wrote​:

Damian Conway via RT \perlbug\-followup@&#8203;perl\.org wrote​:

I have now had the opportunity to investigate the problem\, and have concluded that this has nothing to do with Regexp​::Grammars per se\, except that R​::G is generating the enormous regex that 5.20 is failing to compile.

Yes\, it looks that way to me too. Thanks for supplying that reduction.

The file attached cuts down this regex further still\, by removing all the embedded code blocks\, and the various DEFINEs whose names begin "______0_88".

The symptoms I observed seem to be the same\, though I also get a "panic​: memory wrap" error (apparently when passing 4 GiB of allocated memory).

In blead\, it looks like the immediate culprit is study_chunk() — it starts 185 ms after the start of the sizing pass\, and I haven't yet had the patience to let it run long enough to finish.

It bisects to the following. Yves...?

commit 099ec7dcf9e085a650e6d9010c12ad9649209bf4 Author​: Yves Orton \demerphq@&#8203;gmail\.com Date​: Fri Nov 22 01​:08​:39 2013 +0100

Fix RT \#120600&#8203;: Variable length lookbehind is not variable

Inside of study\_chunk\(\) we have to guard against infinite
recursion with recursive subpatterns\. The existing logic
sort of worked\, but didn't address all cases properly\.

  qr/
    \(?\<W>a\)
    \(?\<BB>
      \(?=\(?&W\)\)\(?\<=\(?&W\)\)
    \)
    \(?&BB\)
  /x;

The pattern in the test would fail when the optimizer
was expanding \(&BB\)\. When it recursed\, it creates a bitmap
for the recursion it performs\, it then jumps back to
the BB node and then eventually does the first \(&W\) call\.
At this point the bit for \(&W\) would be set in the bitmask\.
When the recursion for the \(&W\) exited \(fake exit through
the study frame logic\) the bit was not /unset/\. When the parser
then entered the \(&W\) again it was treated as a nested and
potentially infinite length pattern\.

The fake\-recursion in study\-chunk made it little less obvious
what was going on in the debug output\.

By reorganizing the code and adding logic to unset the bitmap
when exiting this bug was fixed\. Unfortunately this also revealed
another little issue with patterns like this&#8203;:

  qr/x|\(?0\)/
  qr/\(x|\(?1\)\)/

which forced the creation of a new bitmask for each branch\.
Effectively study\_chunk treats each branch as an independent
pattern\, so when we are expanding \(?1\) via the 'x' branch
we dont want that to prevent us from detecting the infinite recursion
in the \(?1\) branch\. If you were to think of trips through study\_chunk
as paths\, and \[\] as recursive processing you would get something like&#8203;:

  BRANCH 'x' END
  BRANCH \(?0\) \[ 'x' END \]
  BRANCH \(?0\) \[ \(?0\) \[ 'x' END \] \]
  \.\.\.

When we want something like&#8203;:

  BRANCH 'x' END
  BRANCH \(?0\) \[ 'x' END \]
  BRANCH \(?0\) \[ \(?0\) INFINITE\_RECURSION \]

So when we deal with a branch we need to make a new recursion bitmask\.

Some basic details on this issue​:

In order to detect infinite recursion\, and to expand certain constructs to make match more efficient\, we make study_chunk walk every possible pathway through the graph formed by the grammar.

So for instance if we have node C which uses D which uses E\, and we can reach C from both A and B then we will walk the full C-D-E twice.

One could probably argue this is wrong and that we should somehow store that C-D-E is "safe" when first hit it\, and then avoid walking it a second time.

This is coupled with the naive bitmask strategy for marking which nodes we have seen.

More to come later.

Yves

p5pRT commented 9 years ago

From @demerphq

On 13 July 2014 16​:27\, Hugo van der Sanden \perlbug\-followup@&#8203;perl\.org wrote​:

# New Ticket Created by Hugo van der Sanden # Please include the string​: [perl #122283] # in the subject line of all future correspondence about this issue. # \<URL​: https://rt-archive.perl.org/perl5/Ticket/Display.html?id=122283 >

This is a bug report for perl from hv@​crypt.org\, generated with the help of perlbug 1.40 running under perl 5.20.0.

----------------------------------------------------------------- [Please describe your issue here]

I've been experimenting with an attempt to take a SQL grammar expressed in BNF and convert it (programmatically) into something that can parse SQL with it as a Regexp​::Grammars (v1.035) grammar.

The code below is (60%) cut down from an interim stage in that process; this reaches about 10MB process size under perl-5.16.3; under perl-5.20.0 it grows to over 1GB. Cutting down the grammar rule by rule does gradually reduce the memory use\, but it remains a high multiple of the memory use under perl-5.16.3\, and I've not yet found any smoking gun; I've included the full 200-odd lines here rather than risk eliding something important.

Damain and I are looking into it\, but he suggested I perlbug it as a heads-up of a possible problem in 5.20\, likely of interest to davem as potentially relating to regexp engine changes.

zen% ulimit -v # I've set a 1GB process-size limit 1000000 zen% /usr/bin/time /opt/perl-5.16.3/bin/perl ./t0 # top(1) shows peak 10MB VIRT ok 8.52user 0.01system 0​:08.54elapsed 99%CPU (0avgtext+0avgdata 34816maxresident)k 0inputs+0outputs (0major+2331minor)pagefaults 0swaps zen% /usr/bin/time /opt/perl-5.20.0/bin/perl ./t0 Out of memory! Command exited with non-zero status 1 41.59user 2.10system 0​:43.83elapsed 99%CPU (0avgtext+0avgdata 3641344maxresident)k 0inputs+0outputs (0major+228082minor)pagefaults 0swaps zen% cat t0 #!/opt/perl-5.20.0/bin/perl use strict; use warnings; use Regexp​::Grammars;

my $g = qr{ ^ \<query_specification> $

\<rule​: simple_Latin_letter> \<simple_Latin_upper_case_letter> | \<simple_Latin_lower_case_letter> \<token​: simple_Latin_upper_case_letter> A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z \<token​: simple_Latin_lower_case_letter> a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z \<token​: digit> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

You really shoud use character classes here\, and not use regex subs for insertable literals. IOW\, (?&digit) should be replaced with $digit which would be defined as​:

$digit= "[0-9]"

Similar for (?&ws) and similar patterns.

Anyway\, I have pushed the following commit which should fix this. Please test.

commit a51d618a82a7057c3aabb600a7a8691d27f44a34 Author​: Yves Orton \demerphq@&#8203;gmail\.com Date​: Fri Sep 19 19​:57​:34 2014 +0200

  rt 122283 - do not recurse into GOSUB/GOSTART when not SCF_DO_SUBSTR

  See also comments in patch. A complex regex "grammar" like that in   RT 122283 causes perl to take literally forever\, and exhaust all   memory during the pattern optimization phase.

  Unfortunately I could not track down exacty why this occured\, but   it was very clear that the excessive recursion was unnecessary and   excessive. By simply eliminating the unncessary recursion performance   goes back to being acceptable.

  I have not thought of a good way to test this change\, so this patch   does not include any tests. Perhaps we can test it using alarm\, but   I will follow up on that later.

Ticket closers​: please dont close the ticket until I have reported that I have applied tests for this.

cheers\, Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 9 years ago

From @hvds

demerphq \demerphq@&#8203;gmail\.com wrote​: :Anyway\, I have pushed the following commit which should fix this. Please :test. : :commit a51d618a82a7057c3aabb600a7a8691d27f44a34 :Author​: Yves Orton \demerphq@&#8203;gmail\.com :Date​: Fri Sep 19 19​:57​:34 2014 +0200 : : rt 122283 - do not recurse into GOSUB/GOSTART when not SCF_DO_SUBSTR

Thanks Yves\, I'll try this out over the weekend.

Hugo

p5pRT commented 9 years ago

From @cpansprout

On Thu Sep 25 00​:42​:31 2014\, demerphq wrote​:

Anyway\, I have pushed the following commit which should fix this. Please test.

commit a51d618a82a7057c3aabb600a7a8691d27f44a34 Author​: Yves Orton \demerphq@&#8203;gmail\.com Date​: Fri Sep 19 19​:57​:34 2014 +0200

Ticket closers​: please dont close the ticket until I have reported that I have applied tests for this.

You applied tests as d9a72fccda. Does that mean this can be closed?

(a51d618 is the cause of bug #122890\, but I don’t think this needs to stay open because of that.)

--

Father Chrysostomos

p5pRT commented 9 years ago

From @demerphq

On 5 October 2014 03​:55\, Father Chrysostomos via RT \< perlbug-followup@​perl.org> wrote​:

On Thu Sep 25 00​:42​:31 2014\, demerphq wrote​:

Anyway\, I have pushed the following commit which should fix this. Please test.

commit a51d618a82a7057c3aabb600a7a8691d27f44a34 Author​: Yves Orton \demerphq@&#8203;gmail\.com Date​: Fri Sep 19 19​:57​:34 2014 +0200

Ticket closers​: please dont close the ticket until I have reported that I have applied tests for this.

You applied tests as d9a72fccda. Does that mean this can be closed?

(a51d618 is the cause of bug #122890\, but I don’t think this needs to stay open because of that.

It is up to you. I would leave it open\, but if you think its better to close it then I am happy with that.

Yves

-- perl -Mre=debug -e "/just|another|perl|hacker/"

p5pRT commented 9 years ago

@cpansprout - Status changed from 'open' to 'resolved'