Open emollier opened 1 month ago
Ah. This isn't part of the human-written API docstrings, but rather Sphinx automatically trying to document the public attributes of the class. That is currently defined as:
class DataHandler(metaclass=DataHandlerMeta):
"""Data handler for parsing NCBI XML from Entrez."""
from Bio import Entrez
global_dtd_dir = os.path.join(Entrez.__path__[0], "DTDs")
global_xsd_dir = os.path.join(Entrez.__path__[0], "XSDs")
local_dtd_dir = None
local_xsd_dir = None
del Entrez
...
The code is explicitly setting the global values relative to the location of the Python code (because we install some DTD and XSD files there). In this content, that picks up some temp code location while the docs are being built.
I don't have an immediate answer for this.
There may be something we can tweak in Sphinx's API documentation code. Or, as you suggest, post-processing the documentation somewhere (e.g. with Python code within the Sphinx conf.py
https://github.com/biopython/biopython/blob/master/Doc/conf.py perhaps).
Or, there may be a code change worth making here which might also avoid this, @mdehoon?
@emollier I think build path variations are no longer tested at https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/python-biopython.html
The Debian Salsa CI config is not yet updated to match that, see https://salsa.debian.org/salsa-ci-team/pipeline/-/issues/368 for a discussion
Until that is fixed, we can to add this snippet on our side to debian/salsa-ci.yml
:
variables:
SALSA_CI_REPROTEST_ARGS: --vary=-build_path
Hi @mr-c, thanks for the hints! Although reproducible-builds.org did capture the regression in the experimental upload, I agree with the general idea that keeping the build path constant is an easy wait to reproduce consecutive builds, and that such particular test is somewhat artificial.
Hi @peterjc, @mdehoon, I'm concerned that my issue will be barely actionable. As I feared, post-processing wouldn't work as there would be leftovers in the search index, see searchindex.js in the diffoscope output. Unless you would like to clean the build path from the public documentation, you may close the issue as unplanned. I will disable build path variation from Salsa CI.
Have a nice day everyone, :) Étienne.
If there was a simple fix, I think we should apply it - but yes being pragmatic let's just close this as unplanned. Thank you!
Oh, this is interesting, while stabilizing the build path, Salsa CI still captured a reproducibility issue. Having had a closer look, the global_dtd_dir and global_xsd_dir have been stabilized, but not the local_dtd_dir and local_xsd_dir. Relevant part of the diffoscope output shows:
─ html2text {}
@@ -187,17 +187,18 @@
class Bio.Entrez.Parser.DataHandler(validate, escape, ignore_errors)
Bases: object
Data handler for parsing NCBI XML from Entrez.
global_dtd_dir = '/tmp/reprotest.oQGqTT/const_build_path/
const_build_path/.pybuild/cpython3_3.12/build/Bio/Entrez/DTDs'
global_xsd_dir = '/tmp/reprotest.oQGqTT/const_build_path/
const_build_path/.pybuild/cpython3_3.12/build/Bio/Entrez/XSDs'
- local_dtd_dir = '/nonexistent/second-build/.config/biopython/Bio/
- Entrez/DTDs'
- local_xsd_dir = None
+ local_dtd_dir = '/tmp/reprotest.oQGqTT/const_build_path/.config/
+ biopython/Bio/Entrez/DTDs'
+ local_xsd_dir = '/tmp/reprotest.oQGqTT/const_build_path/.config/
+ biopython/Bio/Entrez/XSDs'
__init__(validate, escape, ignore_errors)
Create a DataHandler object.
read(source)
Set up the parser and let it read the XML results.
parse(source)
Set up the parser and let it read the XML results.
xmlDeclHandler(version, encoding, standalone)
/nonexistent in the path of one of the tests suggests this is the HOME environment variable which is leaking, and not the build path. I'm bugged about the variable set to None, unless something in the code caught properly that the HOME was indeed non-existent.
This particular reproducibility issue is much easier to normalize from the build environment without upstream changes, but you may be interested to know about it.
Have a nice day, :) Étienne.
It probably is leaking $HOME here (again we don't really want these paths in the user-facing documentation).
Cross reference issue #1517 which was about exposing this property - the goal being to allow the code to default to caching under $HOME, but allow the user to override this (e.g. running on a read-only home filesystem and using /tmp instead for caching).
Hi,
While trying to bump Biopython to version 1.84 in Debian sid, I tried to accomodate to the new way of preparing the documentation, dropping TeX in favor of Sphinx documentation in HTML. In the process, I hit a binary package reproducibility regression, which upon further analysis, points to injection of the package build path into the documentation. The issue is visible on the public documentation page, where circleci home directory appears explicitly in the class Bio.Entrez.Parser.DataHandler description:
I'm afraid that my understanding of tweaking Sphinx is too weak to devise a fix that does not interfere with the proper operation of the code. It may be possible to edit the documentation after the fact, using e.g. sed, but I'm concerned about failing to catch tokens making their way in the integrated search engine (I hit a similar issue in the past while working on the parsl module). Per chance, do you see a way how to mitigate the injection of the build path in the documentation?
Package building reproducibility not being a hard blocker in Debian yet, you can treat this as a wishlist item.
Have a nice day, :) Étienne.
Setup
I am reporting a problem with Biopython version, Python version, and operating system as follows:
Expected behaviour
I would like to avoid having unreproducible build path in the documentation.
Actual behaviour
Unreproducible build path appears in Bio.Entrez.Parser.DataHandler API documentation.
Steps to reproduce
Build the documentation: