Closed eloquence closed 3 years ago
Latest failing build log is here: https://readthedocs.org/api/v2/build/14089116.txt
It seems to me that this part of the log may be relevant:
2021-06-24 17:24:42 (19.3 MB/s) - ‘git-lfs-linux-amd64-v2.12.0.tar.gz’ saved [4675677/4675677]
README.md
CHANGELOG.md
git-lfs
install.sh
Updated git hooks.
Git LFS initialized.
fetch: Fetching reference HEAD
Error updating the git index:
git-lfs filter-process: 1: git-lfs filter-process: git-lfs: not found
error: packet write with format failed
error: Could not write client identification
error: initialization for subprocess 'git-lfs filter-process' failed
fatal: docs/diagrams/SecureDrop-0.3-DFD.png: clean filter 'lfs' failed
Checking out LFS objects: 100% (228/228), 20 MB | 0 B/s, done.
Errors logged to /home/docs/checkouts/readthedocs.org/user_builds/securedrop-docs/checkouts/latest/.git/lfs/logs/20210624T172446.320040923.log
Use `git lfs logs last` to view the log.
Note: This may also be a red herring because the checkout of the objects themselves seems to have succeeded: Checking out LFS objects: 100% (228/228), 20 MB | 0 B/s, done.
On that, see also further down:
[...]
copying images... [ 97%] images/tails_update_notification.png
copying images... [ 98%] upgrade/../images/securedrop-updater.png
copying images... [ 98%] images/yubikey_overview.png
copying images... [ 99%] images/yubikey_oath_hotp_configuration.png
copying images... [100%] images/yubikey_configuration_successful.png
copying downloadable files... [ 10%] images/labels/admin_workstation.png
copying downloadable files... [ 20%] images/labels/journalist_workstation.png
copying downloadable files... [ 30%] images/labels/secure_viewing_station_offline_warning.png
copying downloadable files... [ 40%] images/labels/firewall.png
copying downloadable files... [ 50%] images/labels/app_server.png
copying downloadable files... [ 60%] images/labels/mon_server.png
copying downloadable files... [ 70%] images/labels/usb_admin.png
copying downloadable files... [ 80%] images/labels/usb_journalist.png
copying downloadable files... [ 90%] images/labels/usb_svs.png
copying downloadable files... [100%] images/labels/usb_file_transfer.png
copying static files... done
copying extra files... done
dumping search index in English (code: en) ... done
dumping object inventory... done
build succeeded, 1 warning.
The HTML pages are in _build/html.
As far as I can tell, the warning count refers to: /home/docs/checkouts/readthedocs.org/user_builds/securedrop-docs/checkouts/latest/docs/kernel_troubleshooting.rst:110: WARNING: Cannot analyze code. No Pygments lexer found for "none".
, which seems unrelated.
@gonzalo-bulnes Pretty sure it's the LFS setup failing; note how images like https://docs.securedrop.org/en/stable/_images/journalist-delete_source_account.png resolve to LFS pointers:
$ curl https://docs.securedrop.org/en/stable/_images/journalist-delete_source_account.png
version https://git-lfs.github.com/spec/v1
oid sha256:b8dd91d867a9886050dd986516d3eeff8131697623cb92551132fbbc2b7ab119
size 127752
This from the build log
git-lfs filter-process: 1: git-lfs filter-process: git-lfs: not found
suggests to me that the installation of git-lfs
is failing. Note that RTD does not support LFS out of the box; this is bolted on in our conf.py
here:
Restarting the build makes no difference; this is a persistent failure that started about 2 days ago. I don't see any changes in this repo that are clearly correlated. It's possible that RTD's build environment has changed. I'll see if I can reproduce the issue in a fork as a first step.
Tested a bit more. Findings so far:
git-lfs
version to latest - no difference.os.system
calls for the LFS setup to the recommended subprocess.run
- no difference.SecureDrop-0.3-DFD.png
(which sometimes is included in the error output) - no difference.I did get exactly one successful build with images in my fork, just after making the change to subprocess.run
above (but not repeatably thereafter). The build log from that build is here:
https://readthedocs.org/api/v2/build/14092907.txt
Note how it does not include the fatal: The remote end hung up unexpectedly
error.
One additional mystery: I am seeing both the git-lfs filter-process: 1: git-lfs filter-process: git-lfs: not found
and fatal: The remote end hung up unexpectedly
errors in build logs from months ago, when builds were passing and images were definitely working. Not sure if we're dealing with breakage masked by caching, or if there's something entirely different going on.
One thing I didn't get to try yet is to replace the direct git-lfs
invocation with use of the Python git_lfs
module. Here's an example of that: https://scm.cms.hu-berlin.de/jakimowb/eo-time-series-viewer/-/blob/f4d90e73765e77808f5d5df831a4d24b8c1dd4e1/doc/source/conf.py
Wow, that's a super clear recap! I'll try to reproduce it tomorrow as well, and take a look at the Python git_lfs
module use.
[...] the checkout of the objects themselves seems to have succeeded:
Checking out LFS objects: 100% (228/228), 20 MB | 0 B/s, done.
One additional mystery: I am seeing both the git-lfs filter-process: 1:
git-lfs filter-process: git-lfs: not found and fatal: The remote end hung up unexpectedly
errors in build logs from months ago, when builds were passing and images were definitely working.
Disclaimer: I haven't looked. Are the builds actually failing (being marked as failed)? Or might there be an issue with serving the images vs retrieving them / putting them in the right place? I guess what I'm asking is if the git-lfs errors could be non-fatal and something else started failing downstream?
Somewhere in the build process LFS content is being reverted to pointers, even if we do get images reified in the working tree at the beginning of the build, via git lfs pull
(or fetch
and checkout
) in conf.py
.
I tried using the RTD configuration file to install git-lfs
in the container with apt
, to ensure it was available throughout the process, and could verify in conf.py
that LFS was being installed and content pulled correctly, but could never get any build products to contain the actual images.
I also used the configuration file to ensure RTD was using our requirements file. That made no difference.
None of the Sphinx or RTD code I've looked at so far seems to be doing anything more than simple copies of the images, so I don't know what's happening to them.
Filed upstream issue (though LFS is not officially supported, perhaps they can give some pointers): https://github.com/readthedocs/readthedocs.org/issues/8288
:wave: @eloquence @rmol I'm catching up with the above and it is not immediately clear to me how I could help troubleshooting further before the US weekend. Meanwhile, I was thinking about the impact of this image outage and thinking that #70 (or a subset of it) could be relevant to mitigating some of that impact.
Concretely, I could go through the docs today and open a PR to improve the sections that suffer the most when images are missing if that seems useful.
To illustrate what I'm thinking: a couple of days ago, I noticed that the section Set Up the Network Firewall relies on two screenshots to describe the firewall rules. There is a description of abstract rules further down, but arguably those could be made more visible, and there could also be more precise content complementing or replacing the missing image. (One of the alt
texts is "3 NIC Firewall LAN rules" and constitutes the entirety of the LAN interface subsection). That's an example that's fresh in my mind, but I assume I'd probably find other if performing a systematic review. Any thoughts?
Hello, @gonzalo-bulnes! Thanks for offering to help out. You're right, troubleshooting this problem without direct commit access is likely to be a frustrating problem. =/ I took a brief look just now, and I'm sorry to say I don't have anything to add to @rmol's analysis earlier today. Here's the latest info:
As @rmol observed, the images are indeed valid PNGs, but then they're rather bizarrely reverted to pointers again at some later step in the build prep. Perhaps inspecting the sphinx build logic in more detail would let us inject steps in various parts of the run. It doesn't appear that RTD will let us run any cleanup tasks after the sphinx build command has run.
So, unfortunately, it appears the images in the docs will remain broken over the weekend, but will pick this back up on Monday and try again.
:ok_hand: I'll take a closer look at what the docs look like without images, and see what I can propose to make that better. If it ends up there is some immediate value to it, that will be an available option, and if not it will anyway be a step towards making the docs more resilient in the future. (I like to think of accessibility in broad terms as a problem of content availability in varied circumstances.)
The plan of record is to investigate this a bit further today/tomorrow (upstream has also weighed in), and to potentially disable LFS on this repo if we can't resolve.
Longer term, we are considering self-hosting as an option, which would give us more control over the build architecture (this is no criticism of RTD, which is a great service, but SecureDrop may have outgrown it a while ago).
Two cents comment: I notice that some images in the docs could easily be optimized significantly. (Proof of concept, running /docs/_build/html/_images/yubikey_configuration_successful.png
through TinyPNG (online service) yields 70% gains, from 103.5kB to 31.3kB.)
If disabling LFS, it may be worth adding a step (with an automated check or not) to optimize the images. (Since images would be optimized by the time they are integrated into main
, I believe an automated step at CI time could be enough to keep the Git repo small if everyone works from forks and makes sure their build is green before merging?)
Appears LFS-related, see journalist guide for example:
https://docs.securedrop.org/en/stable/journalist.html
Example image URL: https://docs.securedrop.org/en/stable/_images/journalist-delete_submissions.png
The actual file contents are an LFS pointer, so something appears to be going wrong with the LFS setup in
conf.py