NCAR / ParallelIO

A high-level Parallel I/O Library for structured grid applications
Apache License 2.0
136 stars 53 forks source link

PIO_USE_MALLOC is always applied in bget.c #877

Closed dqwu closed 7 years ago

dqwu commented 7 years ago

The option PIO_USE_MALLOC is turned off by default: option (PIO_USE_MALLOC "Use native malloc (instead of bget package)" OFF)

Based on this option, in config.h we will have a macro defined: #define PIO_USE_MALLOC 0 or #define PIO_USE_MALLOC 1

However, in bget.c, this macro is incorrectly tested (it is always defined, no matter the value is 0 or 1)

#include <config.h>
...
#ifdef PIO_USE_MALLOC
#include <stdlib.h>
#endif
...
#ifdef PIO_USE_MALLOC
    buf = malloc(requested_size);
    return(buf);
#endif
...

In pio_darray_int.c, the macro is correctly tested (check whether the value of the macro is 0 or 1, instead of whether it is defined or not)

#if !PIO_USE_MALLOC
...
#endif

[Some comments] 1) Before PR #122, config.h is not included in bget.c, so the macro is not defined at that time in that source file. 2) For now, suppose the option PIO_USE_MALLOC is OFF by default when we run CMake, in bget.c native malloc is always used. However, in pio_darray_int.c, native malloc is not used. This inconsistency might cause some problems. 3) This issue affects test coverage. As we can see in bget.c, bget memory management will never be used 4) This means we have not been testing bget memory management since PR #122.

[How to fix] The fix is straightforward, just update bget.c to test the value of PIO_USE_MALLOC as in pio_darray_int.c

[Caution] After the fix is applied, a lot of unit tests will fail. This is expected, as we have not tested them since PR #122. Also, some cime_developer tests will fail if we apply this updated PIO2 to ESMCI CIME. We should continue to fix the failing tests.

dqwu commented 7 years ago

@jayeshkrishna Please take a look at this issue. You can assign it back to me to update bget.c, so that we will not apply PIO_USE_MALLOC by default (unless we set that option when we run CMake). The issue is that a lot of unit tests will be affected.

edhartnett commented 7 years ago

I am looking at this now.

But why do the unit tests fail? Is bget failing in some places that malloc succeeded in? If so, why would it do that?

dqwu commented 7 years ago

@edhartnett You can check how many unit tests will fail with the fix. Before PR #122 (config.h not included in bget.c yet), we can assume that we always use bget package in bget.c. After that, we can assume that we always use native malloc in bget.c The issue is that the unit tests have not been tested with bget package being used for a long time.

jayeshkrishna commented 7 years ago

Good catch @dqwu , looks like the bget developers expected (or the code was written when ...) a simple makefile. We need to move from "#ifdef"s to "#if" like you suggested. I will also take a look.

edhartnett commented 7 years ago

I get that bget has not been tested for a long time.

But I still don't get why it fails. We have not made any changes to it at all, so it should work exactly the same...

dqwu commented 7 years ago

@jayeshkrishna I noticed that ESMCI CIME migrated PIO2 recently. For previously used PIO2 (based on hash 17da322, April 15, 2016), the cime_developer test SEQ_Ln9.f19_g16_rx1.A passes. At that time, bget package is used in bget.c (config.h not included until PR #122, Oct 18, 2016) For currently used PIO2 (based on hash 9d8d2e1), SEQ_Ln9.f19_g16_rx1.A also passes. However, it is using native malloc in bget.c now. If we use bget package in bget.c, it will fail. After we update bget.c to use bget package by default, we can fix the failed unit tests in PIO2 first, then check if SEQ_Ln9.f19_g16_rx1.A is also fixed.

dqwu commented 7 years ago

@edhartnett You can test with this patch to bget.c:

--- a/src/clib/bget.c
+++ b/src/clib/bget.c
@@ -400,7 +400,7 @@
 #include <config.h>
 #include <pio.h>
 #include <pio_internal.h>
-#ifdef PIO_USE_MALLOC
+#if PIO_USE_MALLOC
 #include <stdlib.h>
 #endif

@@ -614,7 +614,7 @@ bufsize requested_size;
     int compactseq = 0;
 #endif

-#ifdef PIO_USE_MALLOC
+#if PIO_USE_MALLOC
     //    if(requested_size>maxsize){
     //   maxsize=requested_size;
     //   printf("%s %d %d\n",__FILE__,__LINE__,maxsize);
@@ -847,7 +847,7 @@ bufsize size;
     bufsize osize;                    /* Old size of buffer */
     struct bhead *b;

-#ifdef PIO_USE_MALLOC
+#if PIO_USE_MALLOC
     return(realloc(buf, size));
 #endif
     if ((nbuf = bget(size)) == NULL) { /* Acquire new buffer */
@@ -882,7 +882,7 @@ void *buf;
 {
     struct bfhead *b, *bn;

-#ifdef PIO_USE_MALLOC
+#if PIO_USE_MALLOC
     //    printf("bget free %d %x\n",__LINE__,buf);
     free(buf);
     return;

On my local machine, 66 tests failed out of 90: 2 - test_rearr (Failed) 13 - test_pioc (Failed) 14 - test_pioc_unlim (Failed) 15 - test_pioc_putget (Failed) 16 - test_pioc_fill (Failed) 17 - test_darray (Failed) 18 - test_darray_multi (Failed) 19 - test_darray_multivar (Failed) 20 - test_darray_1d (Failed) 21 - test_darray_3d (Failed) 22 - test_darray_uneven (Failed) 23 - test_decomps (Failed) 24 - pio_unit_test (Failed) 35 - pio_rearr (Failed) 36 - pio_rearr_opts (Failed) 37 - pio_rearr_opts2_1p (Failed) 38 - pio_rearr_opts2_3p (Failed) 39 - pio_rearr_opts2_4p (Failed) 40 - pio_decomp_tests_1p (Failed) 41 - pio_decomp_tests_2p (Failed) 42 - pio_decomp_tests_3p (Failed) 43 - pio_decomp_tests_4p_1agg (Failed) 44 - pio_decomp_tests_4p_2agg (Failed) 45 - pio_decomp_tests_4p_3agg (Failed) 46 - pio_decomp_tests_4p_1iop (Failed) 47 - pio_decomp_tests_4p_2iop (Failed) 48 - pio_decomp_tests_4p_3iop (Failed) 49 - pio_decomp_tests_4p_2iop_2str (Failed) 50 - pio_decomp_tests_4p_2iop_1agg (Failed) 51 - pio_decomp_tests_1d_1p (Failed) 52 - pio_decomp_tests_1d_2p (Failed) 53 - pio_decomp_tests_1d_3p (Failed) 54 - pio_decomp_tests_1d_4p_1agg (Failed) 55 - pio_decomp_tests_1d_4p_2agg (Failed) 56 - pio_decomp_tests_1d_4p_3agg (Failed) 57 - pio_decomp_tests_1d_4p_1iop (Failed) 58 - pio_decomp_tests_1d_4p_2iop (Failed) 59 - pio_decomp_tests_1d_4p_3iop (Failed) 60 - pio_decomp_tests_1d_4p_2iop_2str (Failed) 61 - pio_decomp_tests_1d_4p_2iop_1agg (Failed) 62 - pio_decomp_tests_2d_1p (Failed) 63 - pio_decomp_tests_2d_2p (Failed) 64 - pio_decomp_tests_2d_3p (Failed) 65 - pio_decomp_tests_2d_4p_1agg (Failed) 66 - pio_decomp_tests_2d_4p_2agg (Failed) 67 - pio_decomp_tests_2d_4p_3agg (Failed) 68 - pio_decomp_tests_2d_4p_1iop (Failed) 69 - pio_decomp_tests_2d_4p_2iop (Failed) 70 - pio_decomp_tests_2d_4p_3iop (Failed) 71 - pio_decomp_tests_2d_4p_2iop_2str (Failed) 72 - pio_decomp_tests_2d_4p_2iop_1agg (Failed) 73 - pio_decomp_tests_3d_1p (Failed) 74 - pio_decomp_tests_3d_2p (Failed) 75 - pio_decomp_tests_3d_3p (Failed) 76 - pio_decomp_tests_3d_4p_1agg (Failed) 77 - pio_decomp_tests_3d_4p_2agg (Failed) 78 - pio_decomp_tests_3d_4p_3agg (Failed) 79 - pio_decomp_tests_3d_4p_1iop (Failed) 80 - pio_decomp_tests_3d_4p_2iop (Failed) 81 - pio_decomp_tests_3d_4p_3iop (Failed) 82 - pio_decomp_tests_3d_4p_2iop_2str (Failed) 83 - pio_decomp_tests_3d_4p_2iop_1agg (Failed) 84 - pio_decomp_frame_tests (Failed) 85 - pio_decomp_fillval (Failed) 89 - examplePio (Failed) 90 - example1 (Failed)

edhartnett commented 7 years ago

Yes, as I said, I get THAT they fail. I don't get WHY they fail.

They seem to fail when the final free is made on CN_bpool in PIOc_finalize().

dqwu commented 7 years ago

@edhartnett Maybe we can pick one failed test (the simplest one if possible) to take a close look. It is possible that fixing that single test will fix most or all of the failed tests, if the cause are the same.

jedwards4b commented 7 years ago

I agree - I've traced the problem to the PIOc_finalize, so fixing one is likely to fix them all. Still working on it but have a meeting now...

edhartnett commented 7 years ago

Yes, I have isolated one case.

Valgrind tells me this:

==8331== Invalid free() / delete / delete[] / realloc() ==8331== at 0x4C2CD5A: free (vg_replace_malloc.c:530) ==8331== by 0x44B3AC: free_cn_buffer_pool (pio_darray_int.c:1310) ==8331== by 0x4257C8: PIOc_finalize (pioc.c:881) ==8331== by 0x41A443: main (test_rearr.c:587) ==8331== Address 0x90d6040 is 0 bytes inside a block of size 33,554,432 free'd ==8331== at 0x4C2CD5A: free (vg_replace_malloc.c:530) ==8331== by 0x4307D3: brel (bget.c:1023) ==8331== by 0x41A119: test_compute_maxIObuffersize (test_rearr.c:361) ==8331== by 0x41A42B: main (test_rearr.c:543) ==8331== Block was alloc'd at ==8331== at 0x4C2BBAD: malloc (vg_replace_malloc.c:299) ==8331== by 0x446EAB: compute_buffer_init (pio_darray_int.c:46) ==8331== by 0x425292: PIOc_Init_Intracomm (pioc.c:727) ==8331== by 0x41A3F7: main (test_rearr.c:527)

jedwards4b commented 7 years ago

The free call in pio_darray_int is redundant. PR on the way.

jayeshkrishna commented 7 years ago

Where does CN_bpool get released (if we comment out free(CN_bpool))? We could be mixing brels and frees.

edhartnett commented 7 years ago

Yes, I took out that free, and got errors elsewhere.

Where is the other free happening?

dqwu commented 7 years ago

If that redundant free call is removed, only test_pioc will fail.

edhartnett commented 7 years ago

OK, the test_pioc.c failure can be resolved by adding a line of code that will turn off async testing for writing the netCDF decomposition file. Apparently there is a memory issue that I need to hunt down there, but that's no reason to hold back this fix.

I have a PR with the necessary changes. I will put it up...

jayeshkrishna commented 7 years ago

@dqwu : Did you try a valgrind test (to make sure that after removing the free() call we are not leaking mem)?

edhartnett commented 7 years ago

I did a valgrind test on test_pioc, not the rest.

dqwu commented 7 years ago

@jayeshkrishna I tested PR #879 on ANL workstation. Both LeakSanitizer and Valgrind builds show new leaks in some tests, including the first test (test_spmd). This is expected. After all, we have not been testing bget package since PR #122.

Should we merge PR #879 to develop/master first and continue to create a new issue for the leaks? Or should we fix these leaks in PR #879 (update the PR)?

jedwards4b commented 7 years ago

Please document these leaks with one or more new issues.

jayeshkrishna commented 7 years ago

My worry is that we are masking the issue by commenting the free (when we comment the free(CN_bpool) we don't encounter some issues in the code).

gold2718 commented 7 years ago

I think this issue illustrates the dangers of using preprocessor commands to create logic. I suggest using CMake to set a value (i.e.., true or false) and set a boolean parameter to that value. Then, replace all of the preprocessor 'logic' with tests of this parameter. The result is less uncompiled (or conditionally compiled) code and better test coverage.

dqwu commented 7 years ago

@jayeshkrishna We can look at one test (e.g. test_spmd) that has new leaks (apply PR #879 first), and see if we can find out a way to fix the leaks, without having to call "free(CN_bpool)". Maybe the leaks come from other sources. Valgrind or LeakSanitizer should provide more detailed information.

jedwards4b commented 7 years ago

@jayeshkrishna I don't think we are masking anything here - the valgrind output indicates that the memory was released in the brel call and the following free call is unneeded.

dqwu commented 7 years ago

@jayeshkrishna If we keep "free(CN_bpool)", we have 66 tests failed with segmentation faults. It seems that brel already freed it, and double free will cause undefined issues.

If we remove it, we only have 20 tests with new memory leaks. This is a much better result, and we can create a new issue to investigate and fix the leaks. 1 - test_spmd 7 - test_iosystem2_simple 8 - test_iosystem2_simple2 9 - test_iosystem2 10 - test_iosystem3_simple 11 - test_iosystem3_simple2 12 - test_iosystem3 25 - init_finialize_1_proc 26 - init_finialize_2_proc 27 - init_finalize_2_proc_with_args 28 - pio_file_simple_tests 29 - pio_file_fail 30 - ncdf_simple_tests 31 - ncdf_get_put_1proc 32 - ncdf_get_put_2proc 33 - ncdf_fail 34 - ncdf_inq 86 - pio_iosystem_tests 87 - pio_iosystem_tests2 88 - pio_iosystem_tests3

dqwu commented 7 years ago

@jayeshkrishna It is noticed that the 66 failed tests (call "free(CN_bpool)") and the 20 tests that have new leaks do not overlap.

It seems that due to some logic, for the 66 tests, free(CN_bpool) is already called by brel before, but for the 20 tests, it is not.

So we cannot simply keep or remove the call to "free(CN_bpool)". We need find out if that free was called before or not.

edhartnett commented 7 years ago

Yes I agree that we have not solved this issue yet...

jedwards4b commented 7 years ago

Okay - I agree and I don't see anything in bpoolrelease which would free CN_bpool, in fact CN_bpool is not even passed in.

dqwu commented 7 years ago

@jedwards4b It is passed with a function pointer (relfcn)

static void (*relfcn) _((void *buf)) = NULL;
...
void bectl(compact, acquire, release, pool_incr)
...
bectl(NULL, malloc, free, pio_cnbuffer_limit);
...
relfcn = release;
...
if (relfcn != NULL &&
    ((bufsize) b->bh.bsize) == (pool_len - sizeof(struct bhead))) {
    ...
    (*relfcn)(b);
}

The issue is that for the 66 tests, brel is called which implicitly calls free(CN_bpool), so we cannot call free(CN_bpool) explicitly later. For the 20 tests, brel is never called, so we need to call free(CN_bpool) explicitly to avoid memory leaks.

dqwu commented 7 years ago

@jayeshkrishna The issue is a little clearer now.

For some tests (66 are known), brel() is called, which implicitly calls free(CN_bpool), so we cannot call free twice.

However, for some other tests (20 are known), brel() is never called (still do not know why, we need to find the reason), so there is no issue of implicitly calling of free(CN_bpool).

jedwards4b commented 7 years ago

I'm now finding a different result looking at test_pioc.c - if I comment out the test_malloc_iodesc2 everything else seems to work with the free(CN_bpool) in place. I'm still digging but I think that the issue is elsewhere.

jayeshkrishna commented 7 years ago

ok, I might need to dig a little deeper but my take on this issue is this,

jayeshkrishna commented 7 years ago
bectl(NULL, malloc, bpool_free, pio_cnbuffer_limit);
jedwards4b commented 7 years ago

@jayeshkrishna I think that you've got it. I just tried your suggestion and all is good now.
All 90 tests pass for me.
Thanks

edhartnett commented 7 years ago

Oh, this is awesome. Thanks Jayesh. I will put up a PR with this fix.