Ironholds / poster

Address parsing and normalisation through libpostal
MIT License
59 stars 9 forks source link

Memory leak on normalisation #2

Closed Ironholds closed 8 years ago

Ironholds commented 8 years ago

normalise_addr("Quatre-vignt-douze Ave des Champs-Élysées") [1] "92 avenue des champs-elysees" q(save="no")

leads to

==3443== ==3443== HEAP SUMMARY: ==3443== in use at exit: 783,157,093 bytes in 522,029 blocks ==3443== total heap usage: 546,247 allocs, 24,218 frees, 1,549,186,382 bytes allocated ==3443== ==3443== 67 (24 direct, 43 indirect) bytes in 1 blocks are definitely lost in loss record 79 of 2,045 ==3443== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) ==3443== by 0xD839B06: normalize_string_languages (in /usr/local/lib/libpostal.so.0.0.0) ==3443== by 0xD81CF45: expand_address (in /usr/local/lib/libpostal.so.0.0.0) ==3443== by 0xD5FEED7: normalise_addr(Rcpp::Vector<16, Rcpp::PreserveStorage>) (poster.cpp:60) ==3443== by 0xD5FD5F8: poster_normalise_addr (RcppExports.cpp:33) ==3443== by 0x4F0BA37: ??? (in /usr/lib/R/lib/libR.so) ==3443== by 0x4F4ACCA: Rf_eval (in /usr/lib/R/lib/libR.so) ==3443== by 0x4F4CDBF: ??? (in /usr/lib/R/lib/libR.so) ==3443== by 0x4F4AAD2: Rf_eval (in /usr/lib/R/lib/libR.so) ==3443== by 0x4F4BE56: Rf_applyClosure (in /usr/lib/R/lib/libR.so) ==3443== by 0x4F4A8AE: Rf_eval (in /usr/lib/R/lib/libR.so) ==3443== by 0x4F71D61: Rf_ReplIteration (in /usr/lib/R/lib/libR.so) ==3443== ==3443== LEAK SUMMARY: ==3443== definitely lost: 24 bytes in 1 blocks ==3443== indirectly lost: 43 bytes in 1 blocks ==3443== possibly lost: 0 bytes in 0 blocks ==3443== still reachable: 783,157,026 bytes in 522,027 blocks ==3443== suppressed: 0 bytes in 0 blocks ==3443== Reachable blocks (those to which a pointer was found) are not shown. ==3443== To see them, rerun with: --leak-check=full --show-leak-kinds=all

albarrentine commented 8 years ago

Testing on the pure C lib I'm not seeing this error. I do remember fixing a memory leak in that function at some point. Is this on latest?

albarrentine commented 8 years ago

Think I found it. Calls to expansion_array_destroy should be 1-to-1 with expand_address, otherwise it would leak memory (expansion_array_destroy is currently just a convenience function for freeing an array of char *'s, but may do more if we change the response type at some point).

I like the idea of vectorizing calls to libpostal, amortizes the cost of crossing the C boundary, similar to what we'd do with numeric arrays. Might be useful in some of the other bindings as well.

Ironholds commented 8 years ago

Hmn; interesting. I thought I'd tried that approach and it hadn't worked, but I'll try again!

Ironholds commented 8 years ago

Still happening, even with the creation of "expansion" and its destruction moved inside the loop. Interesting element: while it happens with normalise_addr("Quatre-vignt-douze Ave des Champs-Élysées"), it doesn't happen if the address is instead "fffffffffffffffffffffffffffffffffffffffff"- deliberately the same length in char terms. UTF8/codepoint problem with the E-acute and other non-ASCII chars?

albarrentine commented 8 years ago

Ah, that is on my side. There was a temporary char_array I wasn't freeing (which only got allocated for non-ASCII text, which would explain the "fffffffffffffffffffffffffffffffffffffffff" case). Pushed a fix

albarrentine commented 8 years ago

The in-loop memory leak I mentioned previously kicks in when passing in multiple strings. If it's an n=1 loop with cleanup at the end, it so happens that the expansions pointer is pointing to the memory that needs to be freed and everything is copacetic, but for n > 1 loop iterations, the pointer gets overwritten and the memory for the first n-1 expansions is leaked.

Ironholds commented 8 years ago

Gotcha, so it should be "for each entry, run, cleanup" and not "create an instance, run one by one, cleanup"?

albarrentine commented 8 years ago

Yep, exactly. expand_address/parse_address are "caller frees," Unix-style.

Everything looks good to me in terms of the libpostal calls, memory, etc. The one thing to consider, as mentioned in the other issue, is how to handle the multiple values problem, mostly for expansions since they're not in any particular order (they're more like a set). There's a disambiguation model in the nearish-term works that should with high accuracy ensure that the first expansion is the correct one, but will be another model file to load, probably a small one.

Ironholds commented 8 years ago

Yeah, that makes sense. At the moment I'm just loading the first one anyway; looking forward to that model!

Okay, with the changes I just pushed I think we should be good. I'll test against the latest version to make sure the leak is cleaned. Thanks for your help! Hope the incidental bug identifications were useful :)

albarrentine commented 8 years ago

They were indeed! There may be a few additions to make for the new parser classes, and the disambiguation call will probably entail some minor API changes, but that's at least a couple weeks out - will keep you informed.

Ironholds commented 8 years ago

Yay! Thanks so much :)