Explore assumptions made when we can't find a constant

thegedge commented 4 years ago

With the constant ::Foo::Bar::Spam we look for these files on all load paths:

foo/bar/spam.rb
foo/bar.rb
foo.rb

The first of those to exist is the file that defines the constant. If it's an exact match, we can guarantee a match (in any sane, Ruby app). This may not be correct for (2) or (3) though, if both exist.

This potentially gets even more confusing when we consider things that are loaded via require in paths that reflect a similar structure as autoload paths. Unit tests, for example, can often fit the same hierarchical structure as their counterparts.

Let's look into how we can either a) not make any assumptions and avoid false positives, or b) another solution that completely avoids false positives.

thegedge commented 4 years ago

I'm going to explore emulating autoload behaviour. It will add more complexity to constant_resolver, but it will better ensure correctness.

The gist of the idea is that when we see ::Foo::Bar, we first try to autoload ::Foo. If we find a foo.rb file, parse it. If it has a local definition of Bar, we're done. Otherwise, continue looking for foo/bar.rb. If we find it, parse it for local definitions and finish. Otherwise, we assume we can't find ::Foo::Bar.

exterm commented 4 years ago

There's a performance impact to be measured here.

I could actually see how for a full scan this could be slower than https://github.com/Shopify/packwerk/issues/100, because we may re-parse commonly referenced files a few times. Now, of course, we could memoize.

But in the end I assume for a full scan we would still parse all files twice. Which means that the "load a constant to resolve it" approach could be similarly fast, but more general (also works on constants defined in non-autoloaded code) and simpler.

I guess what I'm saying is that we need a performance comparison between the two approaches.

thegedge commented 4 years ago

I guess what I'm saying is that we need a performance comparison between the two approaches.

Yep, I'm definitely interested in seeing a performance comparison here. If https://github.com/Shopify/packwerk/issues/100 is only slightly slower, we should go with it. We should perhaps establish how much slower is acceptable (on the level of single file, package, and full codebase).

thegedge commented 4 years ago

Some interesting observations from before and after:

Tests were being flagged before, but not after
Autovivified modules are no longer violations in packwerk. My preference would be that it does belong to someone, but I think this is okay?
Some new violations found, but not too many.
Some old violations (with compact form constant references) are no longer violations. This is because the parent namespace gets autoloaded, but the autoloaded namespace doesn't define the expected constant.
- The problem here is that we still assume everything is autoloaded, which is not really the case because some things have already been defined externally (e.g., through require-ing). For example, if I have Foo::Bar in the Spam namespace, Spam::Foo may have been defined in a non-autoloaded location, but we still find a foo.rb and assume that that's what's being referenced.

thegedge commented 4 years ago

For the last point above, we've technically replaced one assumption with another: if we're trying to resolve a constant <Namespace>::MyThing, and can't find MyThing anywhere in <Namespace>, we assume it's defined elsewhere.

Personally, I think this is a better assumption than assuming it's defined in <Namespace>.

thegedge commented 4 years ago

We also want to explore Module#const_source_location. If its performance is good (since we'd have to actually load and execute code), it would be far superior than what we try to emulate here in this codebase.

thegedge commented 4 years ago

I explored (roughly) some performance characteristics of a move to const_source_location here:

Warming up --------------------------------------
const_source_location
                        24.000  i/100ms
          parser_gem    32.000  i/100ms
          rubyvm_ast    39.000  i/100ms
Calculating -------------------------------------
const_source_location
                        255.728  (± 4.7%) i/s -      1.296k in   5.079376s
          parser_gem    330.604  (± 1.5%) i/s -      1.664k in   5.034414s
          rubyvm_ast    400.538  (± 2.0%) i/s -      2.028k in   5.065007s

Comparison:
          rubyvm_ast:      400.5 i/s
          parser_gem:      330.6 i/s - 1.21x  slower
const_source_location:      255.7 i/s - 1.57x  slower

So using const_source_location isn't as slow as I'd expect, but this was also looking at a very uninteresting test file. It's also hard to benchmark this on a single file, since const_source_location would generally need to autoload many other files.

I think it will be better to explore this across larger packages of code, after https://bugs.ruby-lang.org/issues/16764 lands in Ruby 2.7.2.

Benchmark source

```ruby require 'benchmark/ips' require 'parser/current' FILE = ARGV[0] def const_source_location load(FILE, true) end def walk(node) return methods unless node return methods unless node.is_a?(::RubyVM::AbstractSyntaxTree::Node) yield node node.children.each do |child| walk(child) { yield node } end end def parser_gem ast = Parser::CurrentRuby.parse_file(FILE) walk(ast) { } end def rubyvm_ast ast = RubyVM::AbstractSyntaxTree.parse_file(FILE) walk(ast) { } end Benchmark.ips do |x| x.report("const_source_location") { const_source_location } x.report("parser_gem") { parser_gem } x.report("rubyvm_ast") { rubyvm_ast } x.compare! end ```

Shopify / constant_resolver

Explore assumptions made when we can't find a constant #19