codewars / docs

The Codewars Docs :construction: WIP
https://docs.codewars.com
MIT License
57 stars 203 forks source link

Memory management patterns for C kata #230

Closed hobovsky closed 3 years ago

hobovsky commented 3 years ago

Create a descriptions of patterns for memory management which would familiarize C authors with any approach other than "malloc everything".

hobovsky commented 3 years ago

I will move the discussion here, because I think it's difficult to discuss things in PR:

@ggorlen commented (https://github.com/codewars/docs/pull/231#pullrequestreview-566731480):

Interesting article with a lot of potential! While it offers some good food for thought and a few novel ideas I hadn't considered, my take is that it's a bit too opinionated and complicated for its own good.

In practice, in the majority of the C kata I've seen that require dynamic memory allocation, the solution function allocates its own memory with malloc and returns a pointer for the caller to free (the "naive" strategy discussed here). For LeetCode, where nearly all of the problems are translated into C, this strategy is applied across the board. The format consistency makes it easy to focus on the solution right away, and even with 2- or 3-d arrays, I've never found it unduly difficult/burdensome to malloc and return inner arrays and an array of sizes in the case of jagged arrays (as in this example).

That said, I agree with the guide that this "naive" approach isn't always better, both for kata and software engineering in general (important to consider if only to the extent that we don't want to recommend unusual/esoteric/questionable stuff), but it seems to be fine for most kata in practice. The preallocated buffer strategy is fine too and is a great fit for many kata types, but I'm not sure it's helpful to pick sides. Making authors aware of the characteristics of each without offering an opinion seems ideal.

As for the flattened nd arrays, reallocation and other approaches, these seem overly-complex. I can't imagine too many kata will need these. It's worth mentioning the flattened array approach because it exists, may be suitable for the occasional challenge and seems to align well with assembly language translations (I'm not sure, I've never done asm on CW), but is unlikely to mesh very well with typical translations coming from high-level languages that are overwhelmingly going to use multi-level nd arrays.

hobovsky commented 3 years ago

TL;DR

  1. I have a specific problem of clueless authors creating bad content which I am trying to solve. I would appreciate any suggestions how to approach it.
  2. I agree that things are complex, but I believe the complexity is an inherent part and an essential feature of the problem and I do not know if/how it could be avoided. I would be glad to be proven wrong, or explained how to present things in an easy way.

One thing I should probably do is to put a guideline to "explain required allocation method" in this article: https://docs.codewars.com/languages/c/authoring/#compilation-warnings . I haven't put it there, because I thought a dedicated article would be better, but maybe it's not. Besides, it solves only a part of the problem.

Problem I am trying to solve

Then maybe we could proceed the other way round: I will tell you what problem I want to solve, and you will give me ideas how to solve it.

One of my concerns is that after attempting to solve several C kata (mostly low level ones, but not only), I've noticed that C translators come up with ideas which are, um... less than ideal, and after struggling with them several times I would sometimes even bluntly call them retarded. There are kata which require you to:

These are the most prominent examples which put me in real awe, but there's also another class of problems, which require a more or less valid approach when you squint your eyes hard enough, but with insufficient specs:

I could come up with a couple of more examples and rant on them in a very opinionated way, because they trigger me really hard ;)

As I've written somewhere, all I want to do is to stop this. I have a dream to read a description of a C kata one day and do not have to worry about a crash or a leak. I think it could be achieved by educating the authors and I wanted to create an article which would educate them.

Complexity of the article

I agree that the article ended up complex, and I do not like this either. I am asking for a review specifically to hunt for ideas to simplify it. I am afraid that the way how ideas are explained is too complex, and it can get readers discouraged rather than interested.

But I am not sure it can be avoided. I think that complexity of the memory management is an essential feature of the C language, and not just an incidental property of a task. Article is complex, because the topic is complex. Doing things correctly is hard, and understanding what is correct and what is not is hard. I am all for simplifying it, I just do not know how.

On the "naive approach" of malloc in solution, free in tests

Note: I might get some technical details wrong and if I am wrong, I would appreciate being educated here :)

In practice, in the majority of the C kata I've seen that require dynamic memory allocation, the solution function allocates its own memory with malloc and returns a pointer for the caller to free (the "naive" strategy discussed here). For LeetCode, where nearly all of the problems are translated into C, this strategy is applied across the board. The format consistency makes it easy to focus on the solution right away,

That said, I agree with the guide that this "naive" approach isn't always better, both for kata and software engineering in general (important to consider if only to the extent that we don't want to recommend unusual/esoteric/questionable stuff), but it seems to be fine for most kata in practice.

I seriously dislike this argument, for a couple of reasons. Saying that "It works" is just like saying that "I do not have to include this header, because it works." or "I can cast these pointers, it works" or "I do not have to use fuzzy equals, strict comparison works." . A SUT and its test suite are two logically separate things by definition, and performing de/allocations across logical boundaries is straight incorrect and a crash waiting to happen. There's no guarantee that the solution and the tests share the runtime, operate on the same heap, and share the memory metadata. It works due to specific character of what a kata is and how the solution is built. But still I consider it incorrect unless it's documented and made guaranteed to work .
Argument that "It works for others" is, uhm, even less of an argument :)

But even when it were made to work, I would still not be very fond of the idea. It would still be a cross-boundary de/allocation, which is invalid in C. It would make a good fit for Codewars kata, and a bad practice for C coding. It's just like relying on a system to have some specific locale set and getting an unexpected crash when regional settings change. Relying on a specific locale is wrong, assuming a size of a pointer is wrong, assuming that MAX_LONG > MAX_INT is wrong, closing a stream which you haven't opened is wrong, so wy writing a code which assumes that someone can free what they have not malloced is right? You do not create a kata where a solution opens a DB connection which tests would close. It seems to me as a gaping hole in a reasoning of "let's follow good coding practices when solving kata, except in C".

On bringing unnecessary noise to the problem

The format consistency makes it easy to focus on the solution right away

While I see where this argument is coming from, I think it's a bit of a stretch which trades correctness for convenience. Trading correctness for pretty much anything is IMO not impossible, but questionable at best.

C is Brainfuck of general purpose languages. When you have a problem and decide to solve it in C, you have two problems (unless you use regex too, then you have three). When you choose to solve a task in C, you've already committed to suffer. When compared to other languages, C is complex by its nature, just like Haskell is difficult or JS is WTFy. Removing this essential aspects from a kata seems for me to artificially and forcibly skew the problem, and not to simplify it.

But, well, it's just my opinion.

Thanks for listening, and thanks for bearing with me :)

CC @ggorlen @kazk @error256

kazk commented 3 years ago
  1. I have a specific problem of clueless authors creating bad content which I am trying to solve. I would appreciate any suggestions how to approach it.
  2. I agree that things are complex, but I believe the complexity is an inherent part and an essential feature of the problem and I do not know if/how it could be avoided. I would be glad to be proven wrong, or explained how to present things in an easy way.

I don't think there's an easy way to educate "clueless" authors on this topic. Delegating most of that to existing articles by linking to them is probably better than trying to write your own.

In general, inexperienced ones are not capable of considering trade-offs of different approaches because they only know one. Probably the best we can do is to present different approaches and briefly explain pros and cons for each. Then provide some guidance and links for them to learn further from.

When compared to other languages, C is complex by its nature, just like Haskell is difficult or JS is WTFy. Removing this essential aspects from a kata seems for me to artificially and forcibly skew the problem, and not to simplify it.

I recently learned this term:

Waterbed theory is the observation, ascribed to Larry Wall, that some systems, such as human and computer languages, contain a minimum amount of complexity, and that attempting to "push down" the complexity of such a system in one place will invariably cause complexity to "pop up" elsewhere.

The language C is pretty simple, but it's definitely not easy to use correctly. You're responsible for things that other languages let the computer do. It just requires you to be way more careful.

The C authors should be aware of the effects of their decisions and clearly document that. Making this clear is a good start.


because I think it's difficult to discuss things in PR

@hobovsky Do you want to use the GitHub Discussion (a new feature)? See https://github.com/codewars/docs/discussions/234

ggorlen commented 3 years ago

Thanks for listening, and thanks for bearing with me :)

No problem, always happy to discuss and thanks again for putting so much energy into this! I hope I'm not coming across too critically, just want to toss in a somewhat fundamentally different take on it.


I agree that CW has some highly questionable code in its C repertoire, but on the other hand, I don't think that's unique to C--I've seen just as much garbage code in JS or Python on CW as I have with C.

C is Brainfuck of general purpose languages. When you have a problem and decide to solve it in C, you have two problems (unless you use regex too, then you have three). When you choose to solve a task in C, you've already committed to suffer. When compared to other languages, C is complex by its nature, just like Haskell is difficult or JS is WTFy. Removing this essential aspects from a kata seems for me to artificially and forcibly skew the problem, and not to simplify it.

I'm not sure it's all that bad. Sure, there's no STL which is significant, but after getting used to C, I don't typically find it fantastically harder than any other language (and occasionally, I find it's easier for certain types of problems, especially in the algorithm space). The point of saying this is just to suggest that I suspect a lot of our kata authors or translators should consider self-evaluating themselves out of the task if they feel C is overwhelming or uncomfortable.


Saying that "It works" is just like saying that "I do not have to include this header, because it works." or "I can cast these pointers, it works" or "I do not have to use fuzzy equals, strict comparison works."

I think you may be misunderstanding the proposal I'm making--it's essentially the same contract as fopen/fclose and other functions like asprintf and strdup. There is no correctness issue here--it's just a matter of whether the memory allocation is callee- or caller-managed. strcpy and sprintf would be examples of caller-allocated functions. Both are valid and useful in different circumstances.

The fact that LeetCode has applied the callee-allocated approach to hundreds of problems successfully, alongside ASAN seems like pretty strong evidence for it being an effective approach to coding challenges (most other online judge sites use stdout so I have to rely on LC).

Speaking of ASAN, part of the reason there's so much garbage C on CW is (I suspect) because it was written prior to Kaz enabling compiler warnings and better segfault messages a year-ish ago. When this was done, it pretty much borked a bunch of migrated CW content that took time to clean up and not be dumping walls of stderr to the runner.

I went to CW just now and clicked on the first C kata that came up, ran the test suite and got blasted with a wall of warnings. I don't have time to gather much more evidence to support my case but this seems indicative of the typical CW C kata.


So, what to do about all of this? My preferred approach would be to explain the characteristics of C memory management relevant to a typical kata (memory can be callee- or caller- managed, or both, and there are advantages and disadvantages to each approach), set forth the common pitfalls and present kata authors with a few simple examples they can work from, require that they resolve all compiler warnings, caution them about even approaching C authorship or translation if they're just starting out with C, then pretty much trust them to Do The Right Thing (tm).

hobovsky commented 3 years ago

Thanks guys for your inputs, I will try to reiterate on the article and see what I can come up with considering your remarks. Maybe I will just ditch it and put a guideline of clearly documenting the required format of allocated memory in the C authoring guide.

Just let me comment on one thing, but it's not something I am trying to push in any way, it's just somewhat related to a question I still have:

I think you may be misunderstanding the proposal I'm making--it's essentially the same contract as fopen/fclose and other functions like asprintf and strdup. There is no correctness issue here--it's just a matter of whether the memory allocation is callee- or caller-managed. strcpy and sprintf would be examples of caller-allocated functions. Both are valid and useful in different circumstances.

I agree that both approaches of callee- and caller-managed resources are valid and have their place, but strdup and asprintf are neither. Callee-managed memory would be when both allocation and deallocation are performed by means provided by a callee. Also strdup and others are somewhat different than "malloc in solution and free in tests". The former are deliberately designed this way and they transfer the ownership of the resource to a caller who might, or might not, be able to support the responsibility of managing it correctly. While for a kata, such "mixed approach" IMO can be seen as a "design shortcut" at best, and an effect of "not knowing better" in majority of cases, which incidentally happens to work. However, under some circumstances, such mixed approach can also be considered valid, what brings me to the actual question:

A SUT [subject under test, i.e. submitted solution in context of CW] and its test suite are two logically separate things by definition, and performing de/allocations across logical boundaries is straight incorrect and a crash waiting to happen. There's no guarantee that the solution and the tests share the runtime, operate on the same heap, and share the memory metadata. It works due to specific character of what a kata is and how the solution is built. But still I consider [de/allocations crossing the boundary between solution and tests] incorrect unless it's documented and made guaranteed to work.

Am I getting this wrong? I am asking honestly, because maybe I am just remembering something incorrectly, or never knew right, or maybe I just mixed things up. Please correct me if I am wrong and explain where. Is the method of allocating in one place and deallocating in another guaranteed to work always? Is it guaranteed to work on linux? Is it guaranteed to work on Codewars? Maybe I just see an issue while it's not actually there, and all my rants are simply pointless? If so, I am really sorry for that and I would really appreciate being educated here!

hobovsky commented 3 years ago

Please see #235 and tell me if it helps anything.

ggorlen commented 3 years ago

Is the method of allocating in one place and deallocating in another guaranteed to work always? Is it guaranteed to work on linux? Is it guaranteed to work on Codewars? Maybe I just see an issue while it's not actually there, and all my rants are simply pointless? If so, I am really sorry for that and I would really appreciate being educated here!

Yes, you can allocate anywhere and deallocate anywhere else, there is no correctness issue. After compilation, functions are just instruction memory locations that are jumped to and the OS memory manager doesn't distinguish allocation and deallocation instruction locations, only sees the process as a whole making a system call. The issue is purely a matter of design with both options being useful and legitimate in various cases.

Although you're technically correct that asprintf and strdup are hybrid-managed, I'd argue that they still pretty much follow the callee-managed pattern with the nuance that the "destroy" routine is inlined because it's trivially free(s) (maybe I should have used the term "callee-allocated"). This makes sense because the allocated pointer is char *, not a complex opaque pointer/object context that may be refactored, so it's excessive abstraction to wrap it in a strdup_free-type routine that just calls free(s) always and forever. This "hybrid approach" happens to overlap with a lot of CW kata contracts and seems perfectly fine to me. The canonical callee-managed patterns as I understand them are:

Foo *f;
foo_init(f);
// do stuff with f
foo_destroy(f);

and

Foo *f = new_foo();
// do stuff with f
foo_destroy(f);

...with variants for different data types (these examples above follow the "opaque pointer" pattern rather than what a typical kata would use with a char *, int * or int **). In many cases, of course, the function will also return the length of the array or an error status and allocate/populate memory, as in the fopen/fgets functions or calloc. The top pattern is basically fopen/fclose and the bottom pattern is strdup/asprintf/malloc itself if you grant that we can inline free for strings. All of this is "fair game" for designing a reasonable kata solution header.

Writing the foo_destroy routine is appropriate for "OOP" style challenges with opaque pointers, but not so much with char *, char **, int * and int ** that are the typical kata header structures. Having the solver write a separate arr_destroy function would probably be silly. For one, it's basically trivial to write and it just adds more noise that takes focus off the main task. For two, it may not even be necessary for the solver to write the function since memory leaks don't have a meaningful impact unless they're big enough to bring down the process, so there'd have to be an explicit test for that which seems like a waste of time in most cases (even writing such a test would be a PITA because crashing the process would kill the test suite, so you'd have to make a system call to get memory usage--yuck) (actually, Criterion can pick this up because each test is run in a separate process, but it still seems superfluous).

In short, in callee-managed kata, the kata author effectively writes the cleanup function as a simplification to keep the solver from tedious busywork that can't be easily tested anyway, then optionally inlines it into the test suite if it's as simple as free(arr) which is the common case int * that was malloc'd by the callee.


While for a kata, such "mixed approach" IMO can be seen as a "design shortcut" at best

I disagree--it seems plenty legitimate to me, both in terms of "real application" code and kata code (I've used the technique on probably dozens of challenges for Qualified). I'd encourage you to play through a few LeetCode C array challenges and see what you think, if you haven't yet. I stop short of recommending it as a one-size-fits-all solution as LC does, but if I had to pick one approach, this would be it.

One reason I think it works nicely for challenges is that the solver has more control and it creates signal that they follow best practices regarding malloc (important for Qualified). If they're overstepping array bounds and crashing, they only have to look at their own code instead of digging into the test suite to figure out how large the buffer they're given is (something that isn't possible in LC), so it gives a sense of agency to the solver that I find appealing. In many cases, how they use malloc vs calloc and realloc (or heck, asprintf/strdup even though they're non standard) is really fun/interesting and changes the solution dynamic considerably, whereas the preallocated-buffer solutions might all tend to look pretty much the same, depending on the nature of the kata.

Also, for many challenges, the size of the result array isn't necessarily known in advance (e.g. an array filtering operation like "give me all the even numbers in an array"), so the test boilerplate would essentially be giving away critical details and adding contrivance that likely wouldn't be known in a real scenario. UB is possible/likely when the solver hasn't correctly implemented the solution and writes past the preallocated buffer size, or the caller (test suite) assumes they've populated it fully but haven't. And, of course, allocating a massive chunk of memory to cover the possibility of excess only kicks the UB potential can down the road and is not a good look from a "best practices" standpoint, so I can't recommend it.

Again, there are so many knobs and nuances here so it's going to fall on the kata creator to use their C knowledge for good and make reasonable decisions for whatever kata they're making. But in general, I'd say:


Zooming back out (sorry for the longwinded explanation!), the main issues I have with CW C kata in descending severity are UB/compiler warnings/risky casts, sloppy/unreadable/golfed testing boilerplate, and to a lesser extent, weird solution headers and unusual design decisions. If the test suite leaks a bit of memory, requires a rather superfluous malloc or returns a char * when an enum would do, that's not great but not horrible either in my book. On the other hand, UB, compiler warnings and messy/minified boilerplate are horrible and I vote as the top priorities for docs to address. Memory management happens to play a role in all of this, especially UB and warnings.

I develop my C challenges offline and make sure they pass valgrind and emit no warnings. If CW kata always did this One Weird Trick, I think we'd be 90% of the way to awesome territory, with the remaining 10% being "don't minify/obfuscate your test suite" (duh!) and "don't pick weird function headers if you can help it" (rather nuanced and hard to explain).

hobovsky commented 3 years ago

Yes, you can allocate anywhere and deallocate anywhere else, there is no correctness issue. After compilation, functions are just instruction memory locations that are jumped to and the OS memory manager doesn't distinguish allocation and deallocation instruction locations, only sees the process as a whole making a system call.

Is the matter of multiple modules linked with different runtimes, and effectively operating on separate heaps, not an issue? Is it not a problem in general, or is it not a problem in Codewars setup?

ggorlen commented 3 years ago

Ah, I see what you're worried about. Interesting point but I don't think this applies here. I've never seen any evidence that the solution and test suite have different heaps. I don't know much about DLLs and most of linking and loading is magic to me (I have to preface everything by "I'm just another C challenge author, not a professional C developer"), so I can't explain exactly what's going on without looking into it further. Empirically, if this were unsafe, there'd have to be crashes following this pattern sooner or later but I've never seen them.

Curious if anyone can confirm/deny--this would be quite a revelation if I'm mistaken here!

I think in the pre-Kaz code runner era, the C and C++ solution and test suite files were simply concatenated together before building (another reason for rampant CW warnings of implicit function definitions), so it seems pretty likely that the header inclusion in the improved approach is doing the same when they're different files.

hobovsky commented 3 years ago

Back in the days when I used to do professional development in C++ for Windows (it was not much, not directly in C but there were elements of it, and very long time ago, using Visual C++ 6, Win2000-era), problems with incompatible runtimes were somewhat easy to get into. You had multi-thredaded vs. single-threaded variants, debug and release variants, statically and dynamically linked variants, and in effect you ended up with a matrix of 8 possibilities of different compatibility levels, adding multiple installed versions of DLLs and ending in DLL hell on the top of it :) Compile one module with one set of settings, link with a module compiled with another set of settings, and BOOM, "your program has crashed". I was also much, much younger and I could understand something incorrectly or misremember things. But I remember debugging an issue of mismatching calls between strdup and free myself. Fortunately, during development, and not on production :)

Granted, my experience is totally outdated, and related to different platform, and different setup. Things probably changed over time (there's no single-threaded runtime anymore in VS), things probably look differently on linux (is it common to have a runtime linked statically?) and probably does not apply to CW (it's unlikely that test suite and user solution would be compiled separately, with different compilation options). Or, doesn't it?

All what I was trying to discuss is whether the setup of CW runner is safe enough to support the way of constructing kata with asymmetric calls to malloc and free. I am almost certain that it's not generally safe, and I am just too unexperienced with current state of the art, of the technology, and of the CW setup, to tell whether it is safe in this particular setup. If it is universally safe, or even just safe for CW and we agree that it suffices to have kata set up in a way correct in context of CW and not "correct in general", then there's nothing more to discuss and I am just sorry I've wasted your time. However, maybe it is something worth considering, thinking about, or documenting.

error256 commented 3 years ago

Tests are linked together with solutions. Whether they are linked with the standard library statically or dynamically, the use the same library, so that shouldn't be relevant here.

hobovsky commented 3 years ago

I've reworked #231, please let me know if it's better and what would still have to be improved.

Thanks!

error256 commented 3 years ago

I still don't like examples of using strings instead of enums...

If you really want to use strings for some reason

Isn't the tutorial for those who are not sure what to do? Those who have a good reason to really want to use strings, if one even exists, probably won't need the tutorial. Thinking about other options... Preloaded can be used as a header with declarations and #defines and the solution can #include "setup.c".

hobovsky commented 3 years ago

The only reason why I left the suggestion of using strings is that I've seen authors and translators going defensive about my remarks of replacing them with enums, for various reasons. Most common reason was compatibility with other translations of the kata, usually high level languages.

I would be all for discouraging stringenums completely, I just did not want to force users and wanted to leave them some choice. I just thought that a suggestion of replacing strings with enums will be more friendly for translators than a hard guideline. But if you think it's not a good idea, then even better.

hobovsky commented 3 years ago

Okay, since the questions about enums appeared quite often during review, I turned the recommendation into guideline. I removed mentions of string constants and suggested turning them into enums. Let me know what you think.

hobovsky commented 3 years ago

231 collected a couple of green check marks, so going to merge. If anyone notices any issues there, then let me know.

Thank you for reviews and for all your remarks, and sorry for wasting your time on this pointless discussion :)