bazelbuild / rules_python

Bazel Python Rules
https://rules-python.readthedocs.io
Apache License 2.0
541 stars 543 forks source link

pipy dependency generated py_library ignores .pyc files in installed package. #2178

Open qingyouzhao opened 2 months ago

qingyouzhao commented 2 months ago

šŸž bug report

Affected Rule

https://github.com/bazelbuild/rules_python/blob/main/docs/pypi-dependencies.md Specifically py_library with name = "pkg" generated by

pip = use_extension("@rules_python//python/extensions:pip.bzl", "pip")
pip.parse(
    hub_name = "my_deps",
    python_version = "3.11",
    requirements_lock = "//:requirements_lock_3_11.txt",
)
use_repo(pip, "my_deps")

Is this a regression?

I don't know.

Description

I couldn't import a module from pip package that ships python modules with .pyi and .pyc files. For example, I have a example_package that ships with the following file structure

/example_package
  __init__.py
  foo.pyc
  foo.pyi

According to https://peps.python.org/pep-3147/#flow-chart,

import foo from example_package

Should be successful. But the py_library generated specifically excludes "**/*.pyc" from data. If I manually remove the "**/*.pyc" exclusion, the import statement works.

py_library(
    name = "pkg",
    srcs = glob(
        ["site-packages/**/*.py"],
        exclude=[],
        # Empty sources are allowed to support wheels that don't have any
        # pure-Python code, e.g. pymssql, which is written in Cython.
        allow_empty = True,
    ),
    data = [] + glob(
        ["site-packages/**/*"],
        exclude=["**/* *", "**/*.py", "**/*.pyc", "**/*.pyc.*", "**/*.dist-info/RECORD"],
    ),
    # This makes this directory a top-level in the python import
    # search path for anything that depends on this.
    imports = ["site-packages"],
    deps = [],
    tags = ["pypi_name=example_package", "pypi_version=3.21.0"],
    visibility = ["//visibility:public"],
)

How

šŸ”¬ Minimal Reproduction

  1. Creating a pip package following https://packaging.python.org/en/latest/ but replace the foo.py with compiled foo.pyc file.
  2. Add pip dependency to a bazel workspace following guides in https://github.com/bazelbuild/rules_python/blob/main/docs/pypi-dependencies.md
  3. In a py_library or py_binary, import foo from example_package should work but errors out with module not found error.

šŸ”„ Exception or Error

ImportError: cannot import name 'foo' from 'example_package'

šŸŒ Your Environment

Operating System:

Linux

Output of bazel version:

  
Bazelisk version: v1.20.0
Build label: 7.0.2
Build target: @@//src/main/java/com/google/devtools/build/lib/bazel:BazelServer
Build time: Thu Jan 25 16:13:35 2024 (1706199215)
Build timestamp: 1706199215
Build timestamp as int: 1706199215
  

Rules_python version:

  
bazel_dep(name = "rules_python", version = "0.31.0")
  

Anything else relevant?

rickeylev commented 1 month ago

This bug was mentioned in a recent PR, so I'll repost what I said there.

The basic gist is that pyc files have been problematic for build determinism and correctness.

Usually pyc files are created by the runtime when they are first imported. This is problematic for Bazel in two ways:

  1. They use timestamp based pyc files. These are problematic because timestamps are non-deterministic, i.e. the pyc content changes even though the source file hasn't changed. Looking through the Python docs, I can't find any way to change the default pyc invalidation mode.
  2. Because they are created upon first import of a module, race conditions can occur, which results in errors. The errors I've seen thus far are
    • Multiple Python processes will attempt to read/write/delete the same pyc file at the same time (under the hood, the way pyc creation works is a foo.pyc.$TIMESTAMP file is created, then atomically moved to its final location). While that is happening, another Python process, or Bazel process can be reading the file. On windows, this results in an error because an open file is deleted
    • Because the set of files is changing, glob() can see a file, the file gets deleted, and then bazel gives an error about the file not being present.

All that said, this behavior is somewhat surprising and perplexing. By default, individual files are symlinked, and they're in a sandbox, so it's not clear how pycs were getting created in the underlying repo directory. But, all the CI issues went away after ignoring pyc files.

qingyouzhao commented 1 month ago

@rickeylev Thanks for the explanation. I might not fully grasp the details but I get your point on

  1. Runtime created pyc files are problematic
  2. Ignoring pyc files solves all the CI issues

What I am curious is:

  1. Can bazel differentiate runtime created pyc files vs package installed pyc files?
  2. Can we only ignore *.pyc.* to handle foo.pyc.$TIMESTAMP cases while keeping foo.pyc included?
rickeylev commented 1 month ago

Can bazel differentiate runtime created pyc files vs package installed pyc files?

Not really. Both cases are "source" files. So to Bazel, whether it was installed from the package, created by a user, or created by some other process, they look the same.

The closest we're able to do is prevent the interpreter from being able to write the files, e.g. by making everything read-only. Windows strikes here again, though, because it's security model allows an admin user to ignore such read-only attributes.

Alternatively, we could force the interpreter to use DONTWRITEBYTECODE. That might be doable (setting interpreter flags/env has been trickier than expected; this is the sort of thing we have to try to flush out edge cases).

Can we only ignore .pyc. to handle foo.pyc.$TIMESTAMP cases while keeping foo.pyc included?

Yes, but it doesn't prevent the deleting-open-file race condition I mentioned. This only affects windows (only windows gives an error about deleting an open file). It'd be acceptable to have windows ignore pyc, I suppose; our windows support is borderline and best-effort. The main thing is preventing the build errors that occur, where the only thing you can do is re-run bazel and hope things interweave successfully.

rickeylev commented 1 month ago

Also, to clarify:

All that said, this behavior is somewhat surprising and perplexing. By default, individual files are symlinked, and they're in a sandbox, so it's not clear how pycs were getting created in the underlying repo directory.

Part of me suspects this undesirable behavior is limited to (a) just the interpreter and its stdlib, and/or (b) local, non-sandboxed execution of some sort. That's just a theory -- if it could be shown that theory is correct, maybe we could think of some more solutions.

Two more thoughts.

Precompiling is a builtin feature of the rules now. By setting precompile="enabled" on the targets, it'll make Bazel perform its own precopmilation of py sources. We should change the generated pypi code to enable precompiling; we just haven't gotten aroudn to trying that yet.

If the pypi package is a pyc-only package (rare, but supported), then I think it's reasonble for py_library to accept .pyc files directly in srcs. A pyc-only library wouldn't have any of these issues, as there isn't any py source file to try and generate a new pyc file from.

qingyouzhao commented 1 month ago

The main thing is preventing the build errors that occur

In my use case, I am okay with the build errors because the package that I am concerned with is a pyc-only library with pyi files for tooling. I am happy with any special case solution that just works with pure pyc packages.

aignas commented 2 weeks ago

I think this issue could be something that outside contributors could come and help out - the whl_library_targets macro is generating the py_library targets that will be consumed by the downstream. There are a few options here:

Both of the solutions should be first gated by an env variable or a feature flag so that we can roll it out slowly.