Perl / perl5

🐪 The Perl programming language
https://dev.perl.org/perl5/
Other
1.92k stars 551 forks source link

Collected issues related to @INC hooks. #20538

Open demerphq opened 1 year ago

demerphq commented 1 year ago

Description Perl supports "hooks" in the @INC array. The original idea was to allow for source filters, but because they were exposed as sub/method calls to the user the API opens up a number of troublesome issues which unfortunately the code does not account for. We have had bugs related to this previously, but we only fixed them superficially, we never addressed the fundamental issues involved.

The main problem is that @INC hooks can and do modify the @INC array, and the behavior when they do so is pretty much accidental as opposed to intended, and can even result in accesses freed memory.

So the first issue is that code like this:

perl -le'BEGIN { push @INC, sub { warn "overwriting INC array in glob"; *INC=[sub{warn "C"}]; }, sub{ warn "in B" } }; eval "require Frobnitz" or die $@'

will access the original @INC array after it is freed. This just happens to work out to just not do anything as it sees the freed array as having 0 elements and stops processing entirely. (It may be that the sub itself is also destroyed before it finishes executing, i havent checked) Arguably this code should never access freed memory and should output this:

$ ./perl -le'BEGIN { push @INC, sub { warn "overwriting INC array in glob"; *INC=[sub{warn "in C"}]; }, sub{ warn "in B" } }; eval "require Frobnitz" or die $@' 
overwriting INC array in glob at -e line 1.
in C at -e line 1.
Can't locate Frobnitz.pm in @INC (you may need to install the Frobnitz module) (@INC contains: CODE(0x556edbddb900)) at (eval 1) line 1.

Another interesting case is this:

perl -le'my $counter; BEGIN { push @INC, sub { warn "unshifting INC"; unshift @INC, sub{warn "in shift ",++$counter}; }, sub{ warn "in B" } }; eval "require Frobnitz" or die $@' 2>&1 | head

which outputs:

perl -le'my $counter; BEGIN { push @INC, sub { warn "unshifting INC"; unshift @INC, sub{warn "in shift ",++$counter}; }, sub{ warn "in B" } }; eval "require Frobnitz" or die $@' 2>&1 | head 
unshifting INC at -e line 1.
unshifting INC at -e line 1.
unshifting INC at -e line 1.
unshifting INC at -e line 1.
unshifting INC at -e line 1.
unshifting INC at -e line 1.
unshifting INC at -e line 1.
unshifting INC at -e line 1.
unshifting INC at -e line 1.
unshifting INC at -e line 1.
....

eg, it infinite loops because the array keeps growing and the callback keeps getting called over and over. IMO it would be saner for all if this output this:

$ ./perl -le'my $counter; BEGIN { push @INC, sub { warn "A unshifting INC"; unshift @INC, sub{warn "C in shift ",++$counter}; }, sub{ warn "in B" } }; eval "require Frobnitz" or die $@' 2>&1 | head 
A unshifting INC at -e line 1.
C in shift 1 at -e line 1.
in B at -e line 1.
Can't locate Frobnitz.pm in @INC (you may need to install the Frobnitz module) (@INC contains: CODE(0x559a13e94638) /home/yorton/perl5/perlbrew/perls/latest_blead/lib/site_perl/5.37.7/x86_64-linux-thread-multi /home/yorton/perl5/perlbrew/perls/latest_blead/lib/site_perl/5.37.7 /home/yorton/perl5/perlbrew/perls/latest_blead/lib/5.37.7/x86_64-linux-thread-multi /home/yorton/perl5/perlbrew/perls/latest_blead/lib/5.37.7 CODE(0x559a13e85910) CODE(0x559a13e857c0)) at (eval 1) line 1.

or with the reverse order of the "B" and "C" lines.

Another bug is this:

norole:yorton@oncidium:blead:~/git_tree/perl$ perl -le'my $counter; BEGIN { push @INC, bless(sub { warn "A"; }), sub{ warn "B" } }; eval "require Frobnitz" or die $@' 2>&1 | head 
Can't locate object method "INC" via package "main" at (eval 1) line 1.

why is it calling the INC method on an object that does not support it, and why doesnt the error message say "in @INC hook" or something to that effect? IMO this should output the following:

$ perl -le'my $counter; BEGIN { push @INC, sub { warn "A"; }, sub{ warn "B" } }; eval "require Frobnitz" or die $@' 2>&1 | head 
A at -e line 1.
B at -e line 1.
Can't locate Frobnitz.pm in @INC (you may need to install the Frobnitz module) (@INC contains: /home/yorton/perl5/perlbrew/perls/perl-5.34.1/lib/site_perl/5.34.1/x86_64-linux /home/yorton/perl5/perlbrew/perls/perl-5.34.1/lib/site_perl/5.34.1 /home/yorton/perl5/perlbrew/perls/perl-5.34.1/lib/5.34.1/x86_64-linux /home/yorton/perl5/perlbrew/perls/perl-5.34.1/lib/5.34.1 CODE(0x5649a78a39d8) CODE(0x5649a78a3a68)) at (eval 1) line 1.

Another issue is that @INC hooks and the errors that are generated from them are not "crazy proofed", in the sense that you can tie @INC, and then have it return a list of places to look, but then when the error message is constructed show a totally different set of places, that might not even have been inspected. I am not showing an example, but the way the code is structured it is entirely possible. It is somewhat debatable if they should be crazy proofed, but personally i think they should be. A simple example without tie is this:

perl -le'my $counter; BEGIN { push @INC, sub { @INC=()}, }; require Frobnitz;' 2>&1 | head 
Can't locate Frobnitz.pm in @INC (you may need to install the Frobnitz module) (@INC contains:) at -e line 1.

I would argue that the error message should show where we checked, not an empty list.

IMO the @INC hooks should ensure that any given ref is looked at only once during traversal of the @INC array, and that after an @INC returns we should either a) restart processing from the beginning, ignoring anything we have already seen, or b) continue and wrap around to do a full traversal back to the most recently executed @INC hook, assuming that index is still reachable. the former is easier to implement, but the latter might be more efficient. We also need to assume that the old @INC array no longer exists after any @INC hook callback.

Hopefully it is possibly to clean up these issues without imposing any noticeable costs on people doing lots of require statements.

Steps to Reproduce Repro steps are above.

Expected behavior

Perl configuration Not relevant. This ticket is correct to the best of knowledge at 5.37.7

haarg commented 1 year ago

There is currently code in core which relies on @INC traversal being essentially by index. It is a hook which will splice itself out of @INC and expects that the following entry (which now has the same index the hook did originally) to be skipped.

https://github.com/Perl/perl5/blob/blead/dist/base/lib/base.pm#L108

demerphq commented 1 year ago

@haarg: yeah, ill check up on that. but at a quick glance that is exactly the kind of hokey code that I am trying to make redundant. The INC logic is basically broken. People doing interesting things with it have had to work around the brokenness, and maybe in some cases exploited it. Ideally we can make it sane without too much fallout, but i suspect we will see a bit.

demerphq commented 1 year ago

@haarg, i need to play with it more (and im on the way out to dinner) but a closer read suggests that this code might even work unchanged if I were to do the "wrap around approach". Maybe im wrong, tho, my assumption is that any hook in @INC at the end of a failed require should have been called once, including those pushed into the start of the array.

But this is part of the problem. IMO if I have an @INC with only a single callback in it, which populates @INC its first run, removing itself in the process, the stuff it pushes in should be looked up, including the first element. Similarly, if I had a hook at the end, which cleaned up non-existent paths, and maybe added newly discovered ones, id say the newly added ones should be visible before the require finishes.

I think there will be some tension here between "useful behavior" and "current implementation warts".

Frankly im getting to the point when I think exposing our directory lookups as an array was fairly unfortunate. It opens up way too many cans of worms IMO. Ties. Hooks. (That arent even evaled correctly). Blah. Its ridiculously complex.

demerphq commented 1 year ago

An example of the tied case messing up the error message. It says we checked 11, but if did an strace you would see we checked 10.

perl -MTie::Scalar -wle'use strict; my $counter= 10; sub R::FETCH { warn "in R::FETCH"; return $counter++ }; @R::ISA=qw(Tie::StdScalar); tie $INC[-1], "R"; eval "require Frobnitz" or print $@;'
in R::FETCH at -e line 1.
in R::FETCH at -e line 1.
Can't locate Frobnitz.pm in @INC (you may need to install the Frobnitz module) (@INC contains: /home/yorton/perl5/perlbrew/perls/latest_blead/lib/site_perl/5.37.7/x86_64-linux-thread-multi /home/yorton/perl5/perlbrew/perls/latest_blead/lib/site_perl/5.37.7 /home/yorton/perl5/perlbrew/perls/latest_blead/lib/5.37.7/x86_64-linux-thread-multi 11) at (eval 5) line 1.
demerphq commented 1 year ago

Sigh. So far @haarg the code you pointed at defeated me. @ap has come up with a devilishly clever solution there, and I don't see how i can do what that code does with the semantics I wished were in place. My fallback is to provide a way to workaround the old malbehavior with a new var. So when an @INC hook is called the var $INC is set to its index. When the hook returns the value of the iterator is set to its value. Setting it to undef or -1 would trigger a new pass over the @INC array contents and setting it to X-1 would cause it to process item X next (or exit if X was larger than the array). That way we can use an @INC hook to replumb @INC if we wish (which I think you and I agree is not an unreasonable thing to want to do) without breaking existing code. We would still ensure that any given in the hook in the array is called only once during traversing the list.

I was also thinking of supporting a new API for a blessed hook, so that the method INCDIR (as opposed to INC) would return a directory name to search, or undef which would be ignored, so that you could have a hook that can decide if it wants to return something (like "."). (I want to make the code ignore undef in the @INC array anyway.) It is possible such a hook could replace the logic in base.pm, but im not certain and it wouldnt be backwards compatible.

ap commented 1 year ago

For base.pm both the $INC variable and the INCDIR method would allow making the second hook 100% reliable by making it a stand-in for the dot, rather than the current situation where it exists in @INC alongside the dot. But base.pm is somewhat of a unique case as far as @INC hooks because of the requirement to only hide the dot from the top-level require run by base.pm itself, not from nested requires run by the loaded code. So the hooks it places in @INC are not meant to stay there and provide some functionality on an ongoing basis, they want to disappear as soon as possible – and between those two options, $INC would be more useful for that than INCDIR: the second hook would poke an actual dot into @INC at $INC to replace itself and then increment $INC before returning.

demerphq commented 1 year ago

@ap - I was planning on doing both (as in a have a patch working already, although not really release worthy just yet). I see a lot of utility in a smart INCDIR callback. And I see a lot of utility in having some control over the order things get searched. If you can use either to make base more robust then I think that bodes well for the patch.

My I congratulate you again, the logic in base.pm is devilishly clever. Bravo!

demerphq commented 1 year ago

I have pushed https://github.com/Perl/perl5/pull/20547 to address the issues in this ticket.

ap commented 1 year ago

Thank you, thank you. Mind you it took months of staring at each iteration of the code and thinking through all the edge cases and reasoning out the exact requirements of base.pm to arrive at this devilry. Just writing it up in the comment took days of iterating. Happy with the result though.

Case in point re trickiness of working out all the consequences:

making the second hook 100% reliable by making it a stand-in for the dot, rather than the current situation where it exists in @INC alongside the dot

Nope, because other code looking for the dot to remove it will not know that this hook is a concealed dot. The literal dot must be left in @INC to avoid breaking other @INC-manipulating code.

The next best thing I can think of (but this is off the top of my head again) is using $INC in the hook to check whether the next element is a dot before causing it to be skipped. Disturbing the arrangement would then deactivate the hook instead of leaving it to quietly skip the wrong thing.

demerphq commented 1 year ago

Nope, because other code looking for the dot to remove it will not know that this hook is a concealed dot.

@ap: Couldn't it use the object form and test on that with isa()? I was thinking something like:

sub HiddenDot::INCDIR { ${$_[0]}++ ? "." : () } # uses blessed scalar ref as object
sub HiddenDot::new { my ($class,$value)= @_; return bless \$value, $class }

# and in base.pm:
if (@INC and $INC[-1] eq ".") {
  $INC[-1] = HiddenDot->new();
} else if (@INC and UNIVERSAL::ISA($INC[-1],"HiddenDot")) {
  ${$INC[-1]} = 0;
}  
ap commented 1 year ago

Sure, but the point is to cooperate with code that isn’t aware of base.pm’s trickery. Your suggestion works, yes, but it requires other code to add support for this specific base.pm @INC hooking magic, and that is what I’m trying to avoid.

It would help somewhat if HiddenDot could overload its stringification to look like a dot without confusing require about its INCDIR nature, but this won’t help other code which also checks ref-ness. Frankly I’m not sure if it solves enough of the issue to even bother.

If compatibility is the goal (which it is for me) then having the dot be an actual plain string is the ticket. Having $INC available would still help make that more reliable, and that may be the exact level of reliability necessary.

demerphq commented 1 year ago

@ap - i implemented https://github.com/Perl/perl5/commit/0ac2ee3ae306d844b7c5bf9389d70817be94c426 just to prove to myself it could be done. It is not part of my main @INC hardening logic as the existing logic still works untouched. The main advantage as far as I can tell is my way would work if something sorted @INC, or something like that, as part of a hook, whereas the existing code would not. The disadvantage as you said is that other code might not know that this ref is special. I can argue it both ways, at least the current implementation is the well known.

I also added a follow up commit in that branch to support overload. https://github.com/Perl/perl5/compare/blead...yves/base_pm_with_INCDIR

All proof of concept, I dont expect we would want to use it, I think the existing code is good enough, and its better the devil you know than a new one you don't.

ap commented 1 year ago

I have to say… just the sheer amount of additional code needed puts me off.

Moreover, I didn’t think of using $INC to improve the stack depth probe, but I very much like how it allows it to clean itself up without involving the scope guard at all. Nice!

And now that I’ve seen that, it also occurs to me that the dot-hiding hook would benefit in similar fashion (untested):

sub {
  splice @INC, $INC, 1;
  if ( $INC[$INC] eq '.' and not defined caller $lvl ) {
    ++$dot_hidden; ();
  } else {
    --$INC; ();
  }
};

The scope guard is still needed, but is reduced to handling the case of the dot-hiding hook never being reached. This implements the same algorithm – except for fixing an unlikely but important failure mode – but makes it far easier to understand what is going on.

I like what $INC does for this code a great deal.

Certainly INCDIR will be useful for some hooks that intend to stay in @INC, but transient hooks almost by definition will benefit more from $INC.

demerphq commented 1 year ago

I have to say… just the sheer amount of additional code needed puts me off.

Yeah, me too. Although a bit of it is existing code, and I didn't try that hard to minimize it more (a couple of bits could be reduced). I think it could be boiled down a bit further.

I like what $INC does for this code a great deal.

Yeah, it certainly makes some things easier doesn't it. :-)

Certainly INCDIR will be useful for some hooks that intend to stay in @INC, but transient hooks almost by definition will benefit more from $INC.

Yeah, I agree. I see INCDIR being useful in a totally different way than $INC. I think INCDIR could be quite useful in a case where there are a large number of paths and there is a manifest where the modules in those paths can be read from. I just wanted to prove this /could/ be done like this, not that we should use it like this. I think your original solution (maybe with a few tweaks using $INC) is plenty good enough for what it does. The POC does satisfy me that the INCDIR framework is useful and interesting things can be built with it tho, so that is good.