Closed AnFunctionArray closed 1 year ago
On further inspection this:
mgs_ix=-2147452408
Which is used as pointer (is originally int
).
May actually be truncated somewhere I suspect highly.
I also made reproducible file (that is not copyrighted) - basically it consists of 123416 lines of extern char dummy[2009];
.
Here you can get it (if its easier - but you can also copy paste the above line enough times). noncopyrightsample.pp.zip
You can run my https://github.com/AnFunctionArray/cperllexer (basically perl ./parse.pl noncopyrightsample.pp
from project dir).
But I don't think this is the base cause of the problem, the save stack entries in save_magic(), and further entries generated during some of the (?{...}) code isn't being cleaned up, so the save stack gets larger and larger and larger.
Normally this would be handled by wrapping an ENTER/LEAVE pair around the calling code, but I have vague memories of there being a reason this wasn't being done. Do you have any ideas @iabyn ?
Lets ge the optimization patch merged.
Lets ge the optimization patch merged.
Nah it was the fix. I don't know if it was a mistake or not tbh. Because @tonycoz said he was going to investigate it further.
I don't complain - it did fix things.
@AnFunctionArray i just think we should get the optimization patch your wrote merged, my comment wasnt about this ticket. Just a reminder to @tonycoz and @khwilliamson and me to get your optimization patch reviewed and merged.
@AnFunctionArray i just think we should get the optimization patch your wrote merged, my comment wasnt about this ticket. Just a reminder to @tonycoz and @khwilliamson and me to get your optimization patch reviewed and merged.
I'm a little bit tired and read it as an of the sort "Lets go the optimization patch merged." - sorry - yeah I don't mind.
On Tue, Nov 01, 2022 at 05:05:04PM -0700, Tony Cook wrote:
Normally this would be handled by wrapping an ENTER/LEAVE pair around the calling code, but I have vague memories of there being a reason this wasn't being done. Do you have any ideas @iabyn ?
When code blocks were first added to regexes (before even my time!) it was decided that 'local' should accumulate across iterations (but be undone when backtracking) rather than being undone at the end of each code block.
I've always hated this, as it makes it harder internally (as if code blocks in patterns wasn't already complex enough...)
-- Standards (n). Battle insignia or tribal totems.
On Mon, 7 Nov 2022, 16:37 iabyn, @.***> wrote:
On Tue, Nov 01, 2022 at 05:05:04PM -0700, Tony Cook wrote:
Normally this would be handled by wrapping an ENTER/LEAVE pair around the calling code, but I have vague memories of there being a reason this wasn't being done. Do you have any ideas @iabyn ?
When code blocks were first added to regexes (before even my time!) it was decided that 'local' should accumulate across iterations (but be undone when backtracking) rather than being undone at the end of each code block.
I've always hated this, as it makes it harder internally (as if code blocks in patterns wasn't already complex
It sounds like you think this should be changed, should we dig into it and see if we can change it?
Yves
On Tue, Nov 08, 2022 at 01:31:56AM -0800, Yves Orton wrote:
On Mon, 7 Nov 2022, 16:37 iabyn, @.***> wrote:
When code blocks were first added to regexes (before even my time!) it was decided that 'local' should accumulate across iterations (but be undone when backtracking) rather than being undone at the end of each code block.
I've always hated this, as it makes it harder internally (as if code blocks in patterns wasn't already complex
It sounds like you think this should be changed, should we dig into it and see if we can change it?
Well, it's behaviour that (IIRC) is documented in the camel Book - it's certainly a feature not a bug. So although it has made my life hard from time to time when messing in the internals, I've always accepted it and worked around it. I don't think we could change it without breaking stuff.
-- "There's something wrong with our bloody ships today, Chatfield." -- Admiral Beatty at the Battle of Jutland, 31st May 1916.
I don't think we could change it without breaking stuff.
Changing it would certainly break a lot of my stuff, some of which is still in production. I consider it an essential feature for non-trivial recursive regexps such as grammars.
@hvds can you work out a simple example script to demonstrate what this provides? I don't want or intend to break anything, but I would like to understand the intent and background here (and maybe take the time to document it somewhere).
Maybe i misunderstand. My understanding is that in code like this:
/(?{ ... })/
we do not collect locals when the block ends, but we do on backtracking. But i can't quite picture in my mind what this enables exactly. @iabyn described what is supposed to happen and said this is demonstrated in the camel book, but didn't mention where. If you can come up with a simple demo it would be helpful. I will review the camel book, but i suspect since you care about this you can come up with an example fairly directly.
@demerphq That's interesting - I personally don't think I use that. But if it's like this - do perl have destructors - because I could use this possibly as a way to catch backtracking - currently I have this:
(?(?=(pattern))\g{-1}(?{code tru})|(?{code fals})(*F))
But maybe it could be more elegantly written - with this feature.
@AnFunctionArray What do you mean "catch backtracking"? In theory we could have a code block that executes only when traversed into via backtracking. Eg, something like this:
/PAT_A(?-{ print "this only prints if the thing following it fails to match"})PAT_B/
So the (?-{})
would be treated as a zero width always accepting assertion, but when backtracked into would execute its contents like a (?{ })
would. Put another way it would execute when entered "from the right", the opposite of (?{...})
which is a zero width always accepting assertion which executes when entered "from the left". FWIW, it can simplify things to think of what regops do when entered from the left or right, that is what they do when they are supposed to match (from the left), and what they do when they are backtracked into (from the right).
@AnFunctionArray if you are interested in this stuff maybe try reaching out to me on the #p5p
irc channel.
@demerphq I'm definitely interested in this stuff maybe I'll join but I've issue with the fact that you must be constantly online to keep with news there.
@demerphq - maybe if you are interested in joining our discord server @tonycoz ?
If you like use discord.
@hvds can you work out a simple example script to demonstrate what this provides?
I don't have access to the serious examples, those were all at work.
My crossword-helper program provides some less serious examples. Throughout, we may use $succeed = qr{(?=)}, $fail = qr{(?!)}
.
Here's a pattern that matches words that are an anagram of ab...
:
/^(?:(?=.*a)(?=.*b).{5,5})\z/oi
While this matches words that are an anagram of a subset of ab...
:
/^(?{ @d = (3) })(?:(?:a(?!.*a)|b(?!.*b)|.(??{
local $d[0] = $d[0] - 1; $d[0] >= 0 ? $succeed : $fail
}))+)\z/oi
This matches an anagram of a[ab]...
:
/^(?{
$d[0] = [ [ 1, 1 ], [ 3, 3 ], [ 1, 1 ] ];
$e[0] = [ (0) x 3 ];
})(?:(?:a(??{
local $e[0][0] = $e[0][0] + 1;
$e[0][0] > $d[0][0][1] ? $fail : $succeed
})|.(??{
local $e[0][1] = $e[0][1] + 1;
$e[0][1] > $d[0][1][1] ? $fail : $succeed
})|[ab](??{
local $e[0][2] = $e[0][2] + 1;
$e[0][2] > $d[0][2][1] ? $fail : $succeed
}))*(??{
(grep $e[0][$_] < $d[0][$_][0], 0 .. 2) ? $fail : $succeed
}))\z/oi
And this matches an anagram of a subset of a[ab]...
:
/^(?{ @d = (3, 3) })(?:(?:b(?!.*b)|a(?!.*a)|.(??{
local $d[0] = $d[0] - 1; $d[0] >= 0 ? $succeed : $fail
}))+|(?:a(?!.*a.*a)|.(??{
local $d[1] = $d[1] - 1; $d[1] >= 0 ? $succeed : $fail
}))+)\z/oi
@hvds I used to do this but I reckon it was slow so I switched to:
(?(?{checkident()})|(*F))
But I'm not sure how this relates to locals being kept until backtrack (if I understand the feature in question).
@AnFunctionArray we are having a near synchonous conversation in github ticket comments, IMO p5p would make that process quite a bit more efficient.
@demerphq I've written there - It's MAGnet right?
I still haven't exactly figured out why but here is some debug info nevertheless:
The above is on commit:
Plus my optimisation patch (which btw still cuts around half of the execution time - just FYI)
But it was crashing without it as well (and on blead).
The regex is (the executed part at least):
perl -V:
It's not my RAM running out because I've 23 GBs
Some more info (with -O0 build):