Instagram / LibCST

A concrete syntax tree parser and serializer library for Python that preserves many aspects of Python's abstract syntax tree
https://libcst.readthedocs.io/
Other
1.57k stars 192 forks source link

FullyQualifiedNameProvider: Optionally consider pyproject.toml files when determining a file's module name and package #1148

Closed camillol closed 5 months ago

camillol commented 6 months ago

When determining the module name and package for a source file, LibCST currently takes the file name relative to the repository root. With this change, it will instead look for the closest directory containing a pyproject.toml file in the file's ancestors, and consider the file name relative to that directory.

This gives correct module and package names when processing a repository containing multiple packages. For files that are not covered by a pyproject.toml, the search ends with the given repository root, thus retaining the old behavior.

camillol commented 6 months ago

@zsol thanks for taking a look! For context, my use case is a single repository containing many interdependent packages. Because of the interdependencies, it needs to be analyzed as a whole (e.g. I need information on the entire class hierarchy, which spans multiple packages).

AFAICT, the main user of calculate_module_and_package (and the one I'm interested in) is FullyQualifiedNameProvider. It works with FullRepoManager, and the gen_cache method receives only the repo root path and a list of file paths to process.

So, it's not feasible for me to have a separate repo root per package, and I don't see a way to feed the package information into FQNP.gen_cache given the current signature.

AFAICT, gen_cache is meant to do I/O, and does I/O in other providers (e.g. TypeInferenceProvider). I assume the requirement for calculate_module_and_package not to do I/O is for testing purposes. I could make a variant of this function and have FQNP call it, although it could cause problems down the road if something else decides to call calculate_module_and_package and gets inconsistent results with FQNP.

So I think the best approach would be to have a global switch. Or perhaps an option on the FullRepoManager, but in that case I'd still have to change the provider interface a bit to be able to wire it through.

Regarding the default behavior, I'd argue that the existing behavior of inferring the package from the file path relative to the repo root is also a heuristic, and for a repo that contains pyproject.toml files it is less likely to give the intended results than what I am proposing. But I'm fine with pyproject.toml parsing not being the default.

zsol commented 6 months ago

AFAICT, gen_cache is meant to do I/O, and does I/O in other providers (e.g. TypeInferenceProvider).

That's right. I propose to do the pyproject.toml finding in FQNP's gen_cache, and then pass in a root_path to calculate_module_and_package based on that.

My main concern is that calculate_module_and_package is not just an internal implementation detail, it's exposed in libcst.helpers and other code (outside of libcst) might be using it.

Regarding the default behavior, I'd argue that the existing behavior of inferring the package from the file path relative to the repo root is also a heuristic, and for a repo that contains pyproject.toml files it is less likely to give the intended results than what I am proposing. But I'm fine with pyproject.toml parsing not being the default.

Agreed that the existing behavior is also just a heuristic, but I'm not convinced that the pyproject.toml-based heuristic is strictly better (or worse, it's just different), and so I don't think it warrants a backwards incompatible change. Let's make it configurable.

camillol commented 6 months ago

Sounds good. I don't see an existing API for this kind of global configuration. Is it ok if I keep it simple and just make it a global variable?

zsol commented 6 months ago

I don't see an existing API for this kind of global configuration.

Ugh, that's annoying :( Ideally this would be stored in the libcst config file (that's read here) and then passed into FullRepoManager here to be passed into gen_cache.

Maybe we should change the API for gen_cache to accept a configuration object instead of just root_path and timeout. Would you be up for that?

camillol commented 6 months ago

Oh, I'm not using the CLI at all. I'm using LibCST as a library from another program, so that config file doesn't apply.

gen_cache already has an optional argument that most implementations ignore (timeout). I could generalize that as kwargs, and add a keyword argument use_pyproject, provided by FullRepoManager.

zsol commented 6 months ago

Sounds good, then!

codecov[bot] commented 5 months ago

Codecov Report

Attention: Patch coverage is 93.33333% with 3 lines in your changes missing coverage. Please review.

Project coverage is 91.15%. Comparing base (9f54920) to head (452868c). Report is 17 commits behind head on main.

Files Patch % Lines
libcst/helpers/module.py 92.30% 1 Missing :warning:
libcst/metadata/base_provider.py 75.00% 1 Missing :warning:
libcst/metadata/tests/test_metadata_wrapper.py 80.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1148 +/- ## ========================================== + Coverage 91.12% 91.15% +0.03% ========================================== Files 256 261 +5 Lines 26567 26864 +297 ========================================== + Hits 24210 24489 +279 - Misses 2357 2375 +18 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

camillol commented 5 months ago

@zsol I think it's done, PTAL!