UpstandingHackers / hammer

Parser combinators for binary formats, in C. Yes, in C. What? Don't look at me like that.
GNU General Public License v2.0
431 stars 40 forks source link

h_cfgrammar_() leaks memory #182

Open puellavulnerata opened 7 years ago

puellavulnerata commented 7 years ago

While valgrinding unit tests and plugging leaks, I found that /core/grammar/end leaks even if I add h_cfgrammar_free(g) to test_end(). The cause is the calls to h_new() (rather than in the grammar's arena) in if (h_hashset_empty(g->nts)) branch of h_cfgrammar_(). Unfortunately, these cannot simply be changed over to h_arena_malloc() calls because the LL(k) backend frees the grammar once it finishes compiling the parser, but apparently keeps some of these pointers around and uses them during parsing. Further investigation proceeding.

puellavulnerata commented 7 years ago

I believe, tentatively, that the underlying problem here is that collect_nts() relies on the desugared field of the parser nodes, so each HCFChoice there is freed along with the corresponding parser, but in the case where desugared is a terminal we wrap it, and the resulting wrapper HCFChoice does not appear anywhere in the parser, only in the grammar, and so never gets freed.

pesco commented 7 years ago

Andrea Shepard on Tue, Dec 06 2016:

While valgrinding unit tests and plugging leaks, I found that /core/grammar/end leaks even if I add h_cfgrammar_free(g) to test_end(). The cause is the calls to h_new() (rather than in the grammar's arena) in if (h_hashset_empty(g->nts)) branch of h_cfgrammar_(). Unfortunately, these cannot simply be changed over to h_arena_malloc() calls because the LL(k) backend frees the grammar once it finishes compiling the parser, but apparently keeps some of these pointers around and uses them during parsing. Further investigation proceeding.

I've had a quick look and can shed some light:

The real issue is h_desugar, which uses h_new. The case in h_cfgrammar_ that you mention is just a special case where h_desugar doesn't return the form we want. These objects (HCFChoice) are kept because they represent the nonterminals[*]. To free this memory properly, we should consider it a transfer of ownership from the grammar to the table (backend data).

Longer version:

To generate the grammar, the parser is first converted to a sum-of-products representation by h_desugar. The type HCFChoice with tag HCF_CHOICE is the sum and HCFSequence is the product. The base terms are represented by HCFChoice with the other tags (HCF_CHAR, HCF_CHARSET, etc.).

The case you're looking at in h_cfgrammar_ is the special case where h_desugar returns a base term and we need to wrap that in a singleton sequence and a singleton choice to give it the form expected of the start symbol (nonterminal).

The function h_desugar is using h_new in the same style the combinators use it: allocate once, free never. The implicit assumption being that parsers are constructed and compiled once at the beginning of the program and deallocated by termination of the program. Don't look at me, I just got here.

I intended the h_cfgrammar routine to be potentially useful stand-alone, and I don't think we want to burden a user of that function with the detail of (the awkwardly-named) "desugaring", so I think it's right to create these objects there. But when the HCFGrammar is created only transiently as part of compiling a table, it can't free the HCFChoices with it, as discussed.

One simple solution would be to explicitly transfer ownership of the symbols to the table and only free them in h_cfgrammar_free if we still own them. Or instead of explicitly tracking the ownership, distinguish between a full and a "half" free of the grammar; let h_cfgrammar_free free everything and give the backend another function to free only the grammar during compilation and the symbols later, in its own destructor.

-pesco

[*]: They contain, for instance, the function pointers for semantic actions, predicates, and the "reshape" functions.