ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.81k stars 889 forks source link

pyln-testing: Fix BitcoinD FD leak #7669

Open s373nZ opened 6 days ago

s373nZ commented 6 days ago

Fix BitcoinD FD leak in pyln-testing

Description

This PR adds a post-yield resolver sanity-check to the bitcoind test fixture to be sure file descriptors are not leaking.

Related Issues

Changes Made

Checklist

Ensure the following tasks are completed before submitting the PR:

Additional Notes

Please let me know if there is a more elegant way to do this, or more appropriate place to put it.

s373nZ commented 6 days ago

Refactored the file closing operations into a function on TailableProc and call it from the bitcoind fixture. Plus, another successful CI run would be nice...

s373nZ commented 3 days ago

Rebased and removed modifications to CHANGELOG for Changelog-Fixed: hint in commit message.

ShahanaFarooqui commented 3 days ago

@s373nZ Sorry if I overlooked it while jumping between 7669, 7108, 7130, and 7310 :). But could you please remind why the cleanup process is not added to the finalizer or the stop function?

I am testing it with:

And it seems to be working (tests are still running at 32% :-|).

s373nZ commented 2 days ago

@ShahanaFarooqui That's a lot of jumping (and a lot of 7s :). I should have done a better job at consolidating this information into the PR description.

Context

For some context, I ran into this file descriptor issue a few months ago when working on a PR and attempting to run the make pytest locally. The leak produced seemingly random test failures that weren't alleviated until I lifted the ulimit. New to CLN development, I was troubleshooting the issue then under the assumption that the LightningD subprocess was responsible, and tried a wide variety of fixes. Returning to the issue recently in hopes that a fix might stabilize the CI tests somewhat, fresh eyes revealed it is actually the BitcoinD process.

To address your questions directly:

could you please remind why the cleanup process is not added to the https://github.com/ElementsProject/lightning/issues/7130#issuecomment-1983465331

My reading of the Pytest documentation suggested by @cdecker in that comment, specifically the preceding section titled Teardown/Cleanup (AKA Fixture finalization), suggests that yield fixtures are recommended and Any teardown code for that fixture is placed after the yield. The bitcoind fixture is already implemented as a yield fixture, so closing the files as a cleanup operation performed after the yield statement in the current implementation aligns with best practice recommendations as I understand it.

The following section in the docs, Adding finalizers directly reads to apply typically to non-yield fixtures. Seeing as this is actually an issue with every test case which leverages the bitcoind fixture (each leaves four fds open in a (deleted) state throughout the entire pytest suite run), I understand the file closing operations to be necessarily bundled with the yield fixture. I think at the time, @cdecker and I thought that the leak might have affected a subset of test cases and were exploring addfinalizer() as a targeted solution.

In short, the implementation in this PR is indeed using finalizers, but in accordance with yield fixtures rather than addfinalizer.

...or the stop function?

This is a good question, and one I tried to implement initially. TailableProc is inherited by both BitcoinD and LightningD. There are a number of existing tests which stop() and start() the Lightning nodes, and continue to make use of the files throughout the tests to validate log messages. If we close these files when the daemon is stopped, these tests break. Additionally, there are a number of global "teardown checks" in the node_factory fixture itself which depend on the log files to remain open. After trying a wide variety of implementations using stop() and even moving the file open() calls out of the initializer into start(), it became apparent that this approach was too invasive with the current test suite.

Now that it's discovered BitcoinD is the leak culprit, there is still a small mystery as to why LightningD isn't leaking file descriptors in the same manner. My best understanding at the moment is that its daemon is started in a subprocess using Popen() here, and according to the docs:

If close_fds is true, all file descriptors except 0, 1 and 2 will be closed before the child process is executed. Otherwise when close_fds is false, file descriptors obey their inheritable flag as described in Inheritance of File Descriptors.

Changed in version 3.2: The default for close_fds was changed from False to what is described above.

So, I believe the subprocess is taking care of closing LightningD's file descriptors by default, but BitcoinD doesn't get the same treatment.

Alternative implementations

Hope this helps! I haven't mastered Pytest, Pyln, or the CLN test suite by any means, so please let me know if you have recommendations for a better fit solution.

s373nZ commented 2 days ago

Also, I have one tangential question, if you would. When referencing issues in the git commit messages, I've been using the Github issue number that describes the problem both in the first line of the commit message and in the Changelog-*: value. Should the number reference the PR instead? If so, let me know and I'll amend my commit messages.

ShahanaFarooqui commented 2 days ago

Also, I have one tangential question, if you would. When referencing issues in the git commit messages, I've been using the Github issue number that describes the problem both in the first line of the commit message and in the Changelog-*: value. Should the number reference the PR instead? If so, let me know and I'll amend my commit messages.

Typically, including the issue number in commit messages is useful for tracking purposes. It is recommended to reference issues using related, fixes, or closes in the PR description as well.

CHANGELOG list update:

Before each release, the changelog.py script generates entries based on commits merged after the last tag.

For instance, the script will produce below entry for your commit message, "Changelog-Fixed: pyln-testing: Fix file descriptor leak in bitcoind fixture.":

...
### Fixed
 - pyln-testing: Fix file descriptor leak in bitcoind fixture. ([#7669])

... 
[#7669]: https://github.com/ElementsProject/lightning/pull/7669

This way, your changes are automatically included in the appropriate section of the changelog.

ShahanaFarooqui commented 2 days ago

ACK e35e3c540b60fab1a626ca9df1b6d07225c154c0.