bazelbuild / rules_python

Bazel Python Rules
https://rules-python.readthedocs.io
Apache License 2.0
534 stars 541 forks source link

[FR] simplify gazelle_python.yaml #1890

Closed hunshcn closed 3 months ago

hunshcn commented 6 months ago

🚀 feature request

Relevant Rules

gazelle_python.yaml

Description

A clear and concise description of the problem or missing capability... `gazelle_python.yaml` will list all libs on packages ``` matplotlib.backend_tools: matplotlib matplotlib.backends: matplotlib matplotlib.backends.backend_agg: matplotlib matplotlib.backends.backend_cairo: matplotlib matplotlib.backends.backend_gtk3: matplotlib matplotlib.backends.backend_gtk3agg: matplotlib matplotlib.backends.backend_gtk3cairo: matplotlib matplotlib.backends.backend_gtk4: matplotlib matplotlib.backends.backend_gtk4agg: matplotlib matplotlib.backends.backend_gtk4cairo: matplotlib matplotlib.backends.backend_macosx: matplotlib matplotlib.backends.backend_mixed: matplotlib matplotlib.backends.backend_nbagg: matplotlib matplotlib.backends.backend_pdf: matplotlib matplotlib.backends.backend_pgf: matplotlib ... ``` I have a three lines only requirements.in ``` numpy requests seaborn ``` 20 packages requirements.txt but received a 110KB gazelle_python.yaml which has 2517 keys at `.manifest.modules_mapping` It is too big. ### Describe the solution you'd like If you have a solution in mind, please describe it. Can we simplify it and make `gazelle_python.yaml` have only top level package name? In my shallow cognition, such a long imports seems meaningless. I'd like to support it if I can. ### Describe alternatives you've considered Have you considered any alternative solutions or workarounds? no
aignas commented 6 months ago

I am not sure if the complexity is worth it as the package imports may overlap at the top level. It could be possible to improve it, but I am not sure how realistic it is.

Let's imagine a case where you write code

import foo.bar

and the foo.bar import is only provided by foo-bar-plugin python package, but what you have in the gazelle_python.yaml is data that tells gazelle that foo maps to foo python package, so gazelle will be happy and will add the wrong dependency instead of erroring here.

I think this is related to #1877, but might not be.

If somebody was to implement it, I think it should:

hunshcn commented 6 months ago

One possibility is that we only simplify the generated yaml. During the generation, we find the top bifurcation and then cut the branches. And this is well compatible.

for example

    foo: foo
    foo.bar : for-bar
    foo.bar.abc: for-bar
    foo.bar.xyz: for-bar
    foo.bar.xyz.sub: for-bar

to

    foo: foo
    foo.bar : for-bar
aignas commented 6 months ago

One extra idea how we could simplify the gazelle_python.yaml file by a lot is to only include the dependencies that the project is referencing directly. For example, if my project has a pyproject.toml file and I only specify airflow as the dependency in the project, gazelle_python.yaml will be enormous because of all of the transitive airflow dependencies that rules_python knows about. However, the only things that we should actually include in the gazelle_python.yaml are the things that the requirements.in or pyproject.toml mentions.

Such thing could be achieved in two ways:

  1. Modify the pip extension to return a list of whls that are the direct references of the project. This would mean that the pip.parse should accept the requirements.in or the pyproject.toml and use that to generate a list of whl targets that are the direct references. That way the modules_mapping file will only create gazelle_python.yaml entries that are the direct references.
  2. Modify the gazelle tooling to accept a pyproject.toml and/or requirements.in file in order to filter out the unnecessary wheels.

I personally think that 1. might be the better solution, but a little bit more involved, because we could also set different visibility values in the hub repo for transitive and non-transitive dependencies of the project to enforce that all of the direct dependencies of the project are in the requirements.in and/or pyproject.toml. However, this is something that may be a matter of taste and I am not sure if this is a good thing to enforce on everyone.

The second approach would be more doable as golang could be definitely taught how to parse pyproject.toml and that would be something that the user can opt into.

That way we may be able to cut the file size by a lot without doing any more advanced techniques like what was suggested in the comment above.

hunshcn commented 5 months ago

One extra idea how we could simplify the gazelle_python.yaml file by a lot is to only include the dependencies that the project is referencing directly. For example, if my project has a pyproject.toml file and I only specify airflow as the dependency in the project, gazelle_python.yaml will be enormous because of all of the transitive airflow dependencies that rules_python knows about. However, the only things that we should actually include in the gazelle_python.yaml are the things that the requirements.in or pyproject.toml mentions.我们如何大幅简化 gazelle_python.yaml 文件的一个额外想法是仅包含项目直接引用的依赖项。例如,如果我的项目有一个 pyproject.toml 文件,并且我只指定 airflow 作为项目中的依赖项,则 gazelle_python.yaml 将非常庞大,因为所有传递性 airflow rules_python 知道的依赖项。但是,我们实际上应该在 gazelle_python.yaml 中包含的唯一内容是 requirements.inpyproject.toml 提到的内容。

Such thing could be achieved in two ways:这样的事情可以通过两种方式实现:

  1. Modify the pip extension to return a list of whls that are the direct references of the project. This would mean that the pip.parse should accept the requirements.in or the pyproject.toml and use that to generate a list of whl targets that are the direct references. That way the modules_mapping file will only create gazelle_python.yaml entries that are the direct references.修改 pip 扩展以返回作为项目直接引用的whls列表。这意味着 pip.parse 应该接受 requirements.inpyproject.toml 并使用它来生成 whl 目标列表,这些目标是直接参考。这样, modules_mapping 文件将仅创建作为直接引用的 gazelle_python.yaml 条目。
  2. Modify the gazelle tooling to accept a pyproject.toml and/or requirements.in file in order to filter out the unnecessary wheels.修改瞪羚工具以接受 pyproject.toml 和/或 requirements.in 文件,以过滤掉不必要的轮子。

I personally think that 1. might be the better solution, but a little bit more involved, because we could also set different visibility values in the hub repo for transitive and non-transitive dependencies of the project to enforce that all of the direct dependencies of the project are in the requirements.in and/or pyproject.toml. However, this is something that may be a matter of taste and I am not sure if this is a good thing to enforce on everyone.我个人认为 1. 可能是更好的解决方案,但涉及更多一点,因为我们还可以在 hub 存储库中为传递和非传递依赖项设置不同的 visibility 值项目的所有直接依赖项都位于 requirements.in 和/或 pyproject.toml 中。然而,这可能是一个品味问题,我不确定这对每个人来说是否是一件好事。

The second approach would be more doable as golang could be definitely taught how to parse pyproject.toml and that would be something that the user can opt into.第二种方法更可行,因为 golang 绝对可以被教导如何解析 pyproject.toml 并且用户可以选择这样做。

That way we may be able to cut the file size by a lot without doing any more advanced techniques like what was suggested in the comment above.这样我们就可以大幅削减文件大小,而无需使用任何更高级的技术,例如上面评论中建议的技术。

Maybe I missed something.

Whether it is 1 or 2, users need to ensure that all direct dependencies need to live in requirements.in, otherwise we will not be able to resolve this dependency. This requirement is reasonable, but it is difficult to achieve (most projects should not be able to do it because python does not have the same visibility restrictions as bzlmod).

hunshcn commented 5 months ago

WDYT @aignas

aignas commented 5 months ago

I think my previous suggestion about requirements.in/pyproject.toml may have been indeed not worth the effort.

I personally do not mind large files since they are machine generated and consumed and you can add the gazelle_python.yaml to .gitattributes for GH (or other git UI providers) to consider that file machine-generated and hide it from PRs.

If you really want to try out the option of reducing the size of it, please keep in mind #1971 as having more assumptions/business logic in this file could hinder the solution of that issue.

aignas commented 4 months ago

I have thought a little more about this and it may be the proposed would work well in cases to reduce merge conflicts. The smaller and simpler the file, the better.

@hunshcn, let me know if you have still time and willingness to add a new manifest format that people could choose. And if you decide to create a PR, feel free to add me as a reviewer.

hunshcn commented 4 months ago

This is good news. I will do this work later.