amnh / PCG

๐™‹๐™๐™ฎ๐™ก๐™ค๐™œ๐™š๐™ฃ๐™š๐™ฉ๐™ž๐™˜ ๐˜พ๐™ค๐™ข๐™ฅ๐™ค๐™ฃ๐™š๐™ฃ๐™ฉ ๐™‚๐™ง๐™–๐™ฅ๐™ โธบ Haskell program and libraries for general phylogenetic graph search
28 stars 1 forks source link

Investigate possible space leaks with the memoized TCM #81

Closed recursion-ninja closed 5 years ago

recursion-ninja commented 5 years ago

There is the possibility of a space leak somewhere in the FFI binding to the memoized TCM. The space leak could be on the C side or on the Haskell side's handling of the memoized TCM pointers/garbage collection.

The memory exploded when running "small" linguistic data sets. Should be able to replicate results by running the linguistic data set Clara and Peter were curating.

Possibly related to #79.

recursion-ninja commented 5 years ago

I seem to remember a fair amount of testing the 2D component (x -> x -> (Word, x) of the memoized TCM's FFI, but I'm not sure how much testing the 3D component (x -> x -> x -> (Word, x)) received. The 2D component is used on the postorder traversal to compute the preliminary medians. The 3D component is used on the preorder traversal to compute the final medians.

We ran the codebase on Clara & Peter's linguistic data set and the memory exploded, (over 64 GB to build a tree, not a network). I suspected that there was a space leak. Since we were using the discrete metric (non-additive, 1-1, etc) for the linguistic data set, we added a special case function that if the discrete metric was supplied, the cost and medians would be computed natively in Haskell using the very simple logic for that special case and then if a more complex metric was supplied we would use the FFI. After this change was implemented the memory during the Wagner build stayed nearly constant. We later removed calls to the FFI for larger alphabets over any metric and instead computed cost and medians natively in Haskell without any memoization. This was a temporary measure.

There is likely a space leak somewhere regarding the FFI. The space leak might involve how Haskell is handling the pointer(s) to the memoized TCM structure(s), how Haskell is serializing/deserializing across the FFI, or there could be a defect in the memoize TCM's memory management.

There aren't easy ways to test this. I'm thinking of setting up a minimal working example Haskell utility program that takes a random seed, an input alphabet length, and a number of dynamic character pairs to try, then uses that seed to randomly generate the specified number of pairs of dynamic characters from the alphabet of the specified size and ask the memoized TCM for the median and cost. The program will also log which characters were generated and what result was received. We can watch the memory usage of this program and when a problem sequence of characters is detected, we can replay the TCM queries in "pure C" so we can actually use valgrind to detect the memory error (assuming it's on the C/C++ side). If valgrind doesn't detect memory errors, then we can rule out issues on the C/C++ side and the space leak must involve Haskell's handling of the data.

I'd appreciate feedback on this approach from @ima-hima , @Boarders , @wardwheeler .

ima-hima commented 5 years ago

Iโ€™ll try to look at this in the next two weeks.

On Oct 8, 2018, at 15:26, recursion-ninja notifications@github.com wrote:

@ima-hima https://github.com/ima-hima I seem to remember a fair amount of testing the 2D component (x -> x -> (Word, x) of the FFI, but I'm not sure how much testing the 3D component (x -> x -> x -> (Word, x)) received. The 2D component is used on the post order traversal to compute the preliminary medians. The 3D component is used on the preorder traversal to compute the final medians.

We ran the codebase on Clara & Peter's linguistic data set and the memory exploded, (over 64 GB to build a tree, not a network). I suspected that there was a space leak. Since we were using the discrete metric (non-additive, 1-1, etc) for the linguistic data set, we added a special case function that if the discrete metric was supplied, the cost and medians would be computed natively in Haskell using the very simple logic for that special case and then if a more complex metric was supplied we would use the FFI. After this change was implemented the memory during the Wagner build stayed nearly constant. We later removed calls to the FFI for larger alphabets over any metric and instead computed cost and medians natively in Haskell without any memoization. This was a temporary measure.

There is likely a space leak somewhere regarding the FFI. The space leak might involve how Haskell is handling the pointer(s) to the memoized TCM structure(s), how Haskell is serializing/deserializing across the FFI, or there could be a defect in the memoize TCM's memory management.

There aren't easy ways to test this. I'm thinking of setting up a minimal working example Haskell utility program that takes a random seed, an input alphabet length, and a number of dynamic character pairs to try, then uses that seed to randomly generate the specified number of pairs of dynamic characters from the alphabet of the specified size and ask the memoized TCM for the median and cost. The program will also log which characters were generated and what result was received. We can watch the memory usage of this program and when a problem sequence of characters is detected, we can replay the TCM queries in "pure C" so we can actually use valgrind to detect the memory error (assuming it's on the C/C++ side). If valgrind doesn't detect memory errors, then we can rule out issues on the C/C++ side and the space leak must involve Haskell's handling of the data.

I'd appreciate feedback on this approach from @ima-hima https://github.com/ima-hima , @Boarders https://github.com/Boarders , @wardwheeler https://github.com/wardwheeler .

โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/amnh/PCG/issues/81#issuecomment-427951583, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3OsrklXKXXP9JHrVgknVSNmDYRAo-rks5ui6cDgaJpZM4XLIp4.

recursion-ninja commented 5 years ago

Maybe we can check some simple stuff first:

ima-hima commented 5 years ago

1: Couldn't we just output something to a log file whenever it happens? Thatโ€™s a side-effect, but it is on the ugly side of the FFI anyway.

2: Hmm. On the C side I could run a really long valgrind session where I just throw a ridiculous amount of data at it to see if thereโ€™s an actual leak. Checking across the FFI Iโ€™m not sure about. In that case weโ€™re relying on garbage collection, so memory might just get eaten until the end of the run, since the garbage collector might not realize when the memoryโ€™s no longer needed.

e

On Oct 9, 2018, at 12:57, recursion-ninja notifications@github.com wrote:

Maybe we can check some simple stuff first:

Does the deallocator for the memoized TCM ever get called? See if we can look at memory usage over time in a more refined way than using top. โ€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/amnh/PCG/issues/81#issuecomment-428269447, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3OsmnGQnt7ssrI-Bwlh_Zfb9jbY3zdks5ujNVegaJpZM4XLIp4.

recursion-ninja commented 5 years ago

A test is to modify the code to not use the compact TCM for DNA. and run the DNA data-set against the memorized TCM. The TCM space should be exhausted almost immediately, and if the memory usage doesn't increase, then the TCM isn't leaking space.

We can achieve this with a build command after a read command that contains a tree or a network build.

We may want to run two processes, one with the compact TCM for DNA and one that uses the memoized TCM to distinguish the memory usage profiles of each, and how quickly the memoized TCM stabilized.

recursion-ninja commented 5 years ago

There was a space leak on the Haskell side of the FFI. It appears to be the only space leak associated with the TCM based on the results of the above tests.