dotnet / TorchSharp

A .NET library that provides access to the library that powers PyTorch.
MIT License
1.41k stars 183 forks source link

Better metrics #1391

Closed mvphelps closed 1 month ago

mvphelps commented 1 month ago

Improve visibility of disposables (Tensor and PackedSequence) to make memory leaks easier to troubleshoot. No breaking interface changes. This resolves #1390.

This PR is what I'd actually like to see for object lifetime statitics. It gives complete statistics for object creation and disposables whether in a DisposeScope or not, and accurate Attach/Detach counts. Lastly the ThreadTotalLiveCount is no longer estimated, it is exact. Also added documentation on how to use the statistics for troubleshooting memory leaks.

New Documentation: memory leak troubleshooting.md - provides practical guidance on how to use the changes of this PR for useful troubleshooting of memory issues.memory.md was also updated to link to this document.

New Interface Members

Semantic changes: None of the statistics are decremented anymore. All statistics are additive only. If an event happens, it is incremented. The original DisposeScopeManager.Statistics object now represents the total combined counts for both PackedSequence and Tensor. It also has individual attributes for .TensorStatistics and .PackedSequenceStatistics where these counts are tracked individually. A common interface ILifetimeStatistics was introduced for these. Additionally the Dispose for both Tensor and PackedSequence were both modified to prevent reentry and double incrementing disposal metrics.

Defects discovered during implementing this:

  1. TestDisposeScopes.MinimalDisposeWorks had commented assertions, and actually reveals the prior approximate ThreadTotalLiveCount was not capturing live tensors. It now captures them accurately. This also led to discovery of the next item (the reason for the commented assertions).
  2. The .ToTensor method has always leaked a tensor that is never disposed. There are two new tests to capture this behavior as it is currently, both in and out of a DisposeScope, with comments on how they need to change when this defect is fixed. These tests are in different fixtures but both named "ToTensorCreatesOrphanedTensor". I can create a new issue for this if you decide this PR is worth moving forward, or I can try to fix it here in this PR.

I understand this is a larger PR and some siginificant API changes. Given that it provided accurate counts and helped identify defects withing TorchSharp itself, the useful utility provided by these changes is demonstrated. Let me know what you like, hate, or want changed.

Note dependent on PR #1389 (although maybe not, the commit appears in this PR too so maybe we can close that other one).

NiklasGustafsson commented 1 month ago

Ready to merge?

mvphelps commented 1 month ago

I did not expect this one to be this easy. Sure thing, let's merge it! :1st_place_medal:

I'll make an issue for the orphan tensor issue I found and fix that. Then I should be done bugging you. Thank you!