biopython / biopython

Official git repository for Biopython (originally converted from CVS)
http://biopython.org/
Other
4.38k stars 1.76k forks source link

Build path leaking in Bio.Entrez.Parser.DataHandler documentation #4839

Open emollier opened 1 month ago

emollier commented 1 month ago

Hi,

While trying to bump Biopython to version 1.84 in Debian sid, I tried to accomodate to the new way of preparing the documentation, dropping TeX in favor of Sphinx documentation in HTML. In the process, I hit a binary package reproducibility regression, which upon further analysis, points to injection of the package build path into the documentation. The issue is visible on the public documentation page, where circleci home directory appears explicitly in the class Bio.Entrez.Parser.DataHandler description:

Data handler for parsing NCBI XML from Entrez.

global_dtd_dir = '/home/circleci/.pyenv/versions/3.9.19/lib/python3.9/site-packages/Bio/Entrez/DTDs'

global_xsd_dir = '/home/circleci/.pyenv/versions/3.9.19/lib/python3.9/site-packages/Bio/Entrez/XSDs'

local_dtd_dir = '/home/circleci/.config/biopython/Bio/Entrez/DTDs'

local_xsd_dir = '/home/circleci/.config/biopython/Bio/Entrez/XSDs'

I'm afraid that my understanding of tweaking Sphinx is too weak to devise a fix that does not interfere with the proper operation of the code. It may be possible to edit the documentation after the fact, using e.g. sed, but I'm concerned about failing to catch tokens making their way in the integrated search engine (I hit a similar issue in the past while working on the parsl module). Per chance, do you see a way how to mitigate the injection of the build path in the documentation?

Package building reproducibility not being a hard blocker in Debian yet, you can treat this as a wishlist item.

Have a nice day, :) Étienne.

Setup

I am reporting a problem with Biopython version, Python version, and operating system as follows:

>>> import sys; print(sys.version)
3.12.6 (main, Sep  7 2024, 14:20:15) [GCC 14.2.0]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-6.10.9-amd64-x86_64-with-glibc2.40
>>> import Bio; print(Bio.__version__)
1.84

Expected behaviour

I would like to avoid having unreproducible build path in the documentation.

Actual behaviour

Unreproducible build path appears in Bio.Entrez.Parser.DataHandler API documentation.

Steps to reproduce

Build the documentation:

$ make -C Doc html
peterjc commented 1 month ago

Ah. This isn't part of the human-written API docstrings, but rather Sphinx automatically trying to document the public attributes of the class. That is currently defined as:


class DataHandler(metaclass=DataHandlerMeta):
    """Data handler for parsing NCBI XML from Entrez."""

    from Bio import Entrez

    global_dtd_dir = os.path.join(Entrez.__path__[0], "DTDs")
    global_xsd_dir = os.path.join(Entrez.__path__[0], "XSDs")
    local_dtd_dir = None
    local_xsd_dir = None

    del Entrez

    ...

The code is explicitly setting the global values relative to the location of the Python code (because we install some DTD and XSD files there). In this content, that picks up some temp code location while the docs are being built.

I don't have an immediate answer for this.

There may be something we can tweak in Sphinx's API documentation code. Or, as you suggest, post-processing the documentation somewhere (e.g. with Python code within the Sphinx conf.py https://github.com/biopython/biopython/blob/master/Doc/conf.py perhaps).

Or, there may be a code change worth making here which might also avoid this, @mdehoon?

mr-c commented 1 month ago

@emollier I think build path variations are no longer tested at https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/python-biopython.html

The Debian Salsa CI config is not yet updated to match that, see https://salsa.debian.org/salsa-ci-team/pipeline/-/issues/368 for a discussion

Until that is fixed, we can to add this snippet on our side to debian/salsa-ci.yml:

variables:
  SALSA_CI_REPROTEST_ARGS: --vary=-build_path
emollier commented 1 month ago

Hi @mr-c, thanks for the hints! Although reproducible-builds.org did capture the regression in the experimental upload, I agree with the general idea that keeping the build path constant is an easy wait to reproduce consecutive builds, and that such particular test is somewhat artificial.

Hi @peterjc, @mdehoon, I'm concerned that my issue will be barely actionable. As I feared, post-processing wouldn't work as there would be leftovers in the search index, see searchindex.js in the diffoscope output. Unless you would like to clean the build path from the public documentation, you may close the issue as unplanned. I will disable build path variation from Salsa CI.

Have a nice day everyone, :) Étienne.

peterjc commented 1 month ago

If there was a simple fix, I think we should apply it - but yes being pragmatic let's just close this as unplanned. Thank you!

emollier commented 1 month ago

Oh, this is interesting, while stabilizing the build path, Salsa CI still captured a reproducibility issue. Having had a closer look, the global_dtd_dir and global_xsd_dir have been stabilized, but not the local_dtd_dir and local_xsd_dir. Relevant part of the diffoscope output shows:

─ html2text {}
@@ -187,17 +187,18 @@
   class Bio.Entrez.Parser.DataHandler(validate, escape, ignore_errors)
       Bases: object
       Data handler for parsing NCBI XML from Entrez.
         global_dtd_dir = '/tmp/reprotest.oQGqTT/const_build_path/
         const_build_path/.pybuild/cpython3_3.12/build/Bio/Entrez/DTDs'
         global_xsd_dir = '/tmp/reprotest.oQGqTT/const_build_path/
         const_build_path/.pybuild/cpython3_3.12/build/Bio/Entrez/XSDs'
-        local_dtd_dir = '/nonexistent/second-build/.config/biopython/Bio/
-        Entrez/DTDs'
-        local_xsd_dir = None
+        local_dtd_dir = '/tmp/reprotest.oQGqTT/const_build_path/.config/
+        biopython/Bio/Entrez/DTDs'
+        local_xsd_dir = '/tmp/reprotest.oQGqTT/const_build_path/.config/
+        biopython/Bio/Entrez/XSDs'
         __init__(validate, escape, ignore_errors)
             Create a DataHandler object.
         read(source)
             Set up the parser and let it read the XML results.
         parse(source)
             Set up the parser and let it read the XML results.
         xmlDeclHandler(version, encoding, standalone)

/nonexistent in the path of one of the tests suggests this is the HOME environment variable which is leaking, and not the build path. I'm bugged about the variable set to None, unless something in the code caught properly that the HOME was indeed non-existent.

This particular reproducibility issue is much easier to normalize from the build environment without upstream changes, but you may be interested to know about it.

Have a nice day, :) Étienne.

peterjc commented 1 month ago

It probably is leaking $HOME here (again we don't really want these paths in the user-facing documentation).

Cross reference issue #1517 which was about exposing this property - the goal being to allow the code to default to caching under $HOME, but allow the user to override this (e.g. running on a read-only home filesystem and using /tmp instead for caching).