s373nZ commented 6 days ago

Fix BitcoinD FD leak in pyln-testing

Description

This PR adds a post-yield resolver sanity-check to the bitcoind test fixture to be sure file descriptors are not leaking.

Related Issues

Closes #7130 (if it works)
I made this PR with the faint hope that it might be related to the build failures in #7310 as well. Need to trigger CI to see...

Changes Made

[ ] Feature: Brief description of the new feature or functionality added.
[x] Bug Fix: Brief description of the bug fixed and how it was resolved.
[ ] Refactor: Any code improvements or refactoring done without changing the functionality.

Checklist

Ensure the following tasks are completed before submitting the PR:

[x] Changelog has been added in relevant commit/s.
[ ] Tests have been added or updated to cover the changes.
[ ] Documentation has been updated as needed.
[ ] Any relevant comments or TODOs have been addressed or removed.

Additional Notes

Please let me know if there is a more elegant way to do this, or more appropriate place to put it.

s373nZ commented 6 days ago

Refactored the file closing operations into a function on TailableProc and call it from the bitcoind fixture. Plus, another successful CI run would be nice...

s373nZ commented 3 days ago

Rebased and removed modifications to CHANGELOG for Changelog-Fixed: hint in commit message.

ShahanaFarooqui commented 3 days ago

@s373nZ Sorry if I overlooked it while jumping between 7669, 7108, 7130, and 7310 :). But could you please remind why the cleanup process is not added to the finalizer or the stop function?

I am testing it with:

1st terminal: make pytest
2nd terminal: Get the process id with ps aux | grep pytest
3rd terminal: lsof -p ${PID} | wc -l

And it seems to be working (tests are still running at 32% :-|).

s373nZ commented 2 days ago

@ShahanaFarooqui That's a lot of jumping (and a lot of 7s :). I should have done a better job at consolidating this information into the PR description.

Context

For some context, I ran into this file descriptor issue a few months ago when working on a PR and attempting to run the make pytest locally. The leak produced seemingly random test failures that weren't alleviated until I lifted the ulimit. New to CLN development, I was troubleshooting the issue then under the assumption that the LightningD subprocess was responsible, and tried a wide variety of fixes. Returning to the issue recently in hopes that a fix might stabilize the CI tests somewhat, fresh eyes revealed it is actually the BitcoinD process.

To address your questions directly:

could you please remind why the cleanup process is not added to the https://github.com/ElementsProject/lightning/issues/7130#issuecomment-1983465331

My reading of the Pytest documentation suggested by @cdecker in that comment, specifically the preceding section titled Teardown/Cleanup (AKA Fixture finalization), suggests that yield fixtures are recommended and Any teardown code for that fixture is placed after the yield. The bitcoind fixture is already implemented as a yield fixture, so closing the files as a cleanup operation performed after the yield statement in the current implementation aligns with best practice recommendations as I understand it.

The following section in the docs, Adding finalizers directly reads to apply typically to non-yield fixtures. Seeing as this is actually an issue with every test case which leverages the bitcoind fixture (each leaves four fds open in a (deleted) state throughout the entire pytest suite run), I understand the file closing operations to be necessarily bundled with the yield fixture. I think at the time, @cdecker and I thought that the leak might have affected a subset of test cases and were exploring addfinalizer() as a targeted solution.

In short, the implementation in this PR is indeed using finalizers, but in accordance with yield fixtures rather than addfinalizer.

...or the stop function?

This is a good question, and one I tried to implement initially. TailableProc is inherited by both BitcoinD and LightningD. There are a number of existing tests which stop() and start() the Lightning nodes, and continue to make use of the files throughout the tests to validate log messages. If we close these files when the daemon is stopped, these tests break. Additionally, there are a number of global "teardown checks" in the node_factory fixture itself which depend on the log files to remain open. After trying a wide variety of implementations using stop() and even moving the file open() calls out of the initializer into start(), it became apparent that this approach was too invasive with the current test suite.

Now that it's discovered BitcoinD is the leak culprit, there is still a small mystery as to why LightningD isn't leaking file descriptors in the same manner. My best understanding at the moment is that its daemon is started in a subprocess using Popen() here, and according to the docs:

If close_fds is true, all file descriptors except 0, 1 and 2 will be closed before the child process is executed. Otherwise when close_fds is false, file descriptors obey their inheritable flag as described in Inheritance of File Descriptors.

Changed in version 3.2: The default for close_fds was changed from False to what is described above.

So, I believe the subprocess is taking care of closing LightningD's file descriptors by default, but BitcoinD doesn't get the same treatment.

Alternative implementations

While closing the files in stop() doesn't seem to be the right solution, there may be other options. Another idea was to implement it in the TailableProc destructor __del__(), but I'm pretty sure I tried that last March and didn't get it working.
Also, I considered the name cleanup() for the callback function in TailableProc, which is a more general name. However, that function would be implicitly overridden by LightningD here when called by LightningNode here. This could probably work, too, but it was preferable to avoid unexpected behavior by injecting assumptions into the existing inheritance hierarchy.

Hope this helps! I haven't mastered Pytest, Pyln, or the CLN test suite by any means, so please let me know if you have recommendations for a better fit solution.

s373nZ commented 2 days ago

Also, I have one tangential question, if you would. When referencing issues in the git commit messages, I've been using the Github issue number that describes the problem both in the first line of the commit message and in the Changelog-*: value. Should the number reference the PR instead? If so, let me know and I'll amend my commit messages.

ShahanaFarooqui commented 2 days ago

Also, I have one tangential question, if you would. When referencing issues in the git commit messages, I've been using the Github issue number that describes the problem both in the first line of the commit message and in the Changelog-*: value. Should the number reference the PR instead? If so, let me know and I'll amend my commit messages.

Typically, including the issue number in commit messages is useful for tracking purposes. It is recommended to reference issues using related, fixes, or closes in the PR description as well.

CHANGELOG list update:

Before each release, the changelog.py script generates entries based on commits merged after the last tag.

For instance, the script will produce below entry for your commit message, "Changelog-Fixed: pyln-testing: Fix file descriptor leak in bitcoind fixture.":

...
### Fixed
 - pyln-testing: Fix file descriptor leak in bitcoind fixture. ([#7669])

... 
[#7669]: https://github.com/ElementsProject/lightning/pull/7669

This way, your changes are automatically included in the appropriate section of the changelog.