Closed dqwu closed 7 years ago
@jayeshkrishna Please take a look at this issue. You can assign it back to me to update bget.c, so that we will not apply PIO_USE_MALLOC by default (unless we set that option when we run CMake). The issue is that a lot of unit tests will be affected.
I am looking at this now.
But why do the unit tests fail? Is bget failing in some places that malloc succeeded in? If so, why would it do that?
@edhartnett You can check how many unit tests will fail with the fix. Before PR #122 (config.h not included in bget.c yet), we can assume that we always use bget package in bget.c. After that, we can assume that we always use native malloc in bget.c The issue is that the unit tests have not been tested with bget package being used for a long time.
Good catch @dqwu , looks like the bget developers expected (or the code was written when ...) a simple makefile. We need to move from "#ifdef"s to "#if" like you suggested. I will also take a look.
I get that bget has not been tested for a long time.
But I still don't get why it fails. We have not made any changes to it at all, so it should work exactly the same...
@jayeshkrishna I noticed that ESMCI CIME migrated PIO2 recently. For previously used PIO2 (based on hash 17da322, April 15, 2016), the cime_developer test SEQ_Ln9.f19_g16_rx1.A passes. At that time, bget package is used in bget.c (config.h not included until PR #122, Oct 18, 2016) For currently used PIO2 (based on hash 9d8d2e1), SEQ_Ln9.f19_g16_rx1.A also passes. However, it is using native malloc in bget.c now. If we use bget package in bget.c, it will fail. After we update bget.c to use bget package by default, we can fix the failed unit tests in PIO2 first, then check if SEQ_Ln9.f19_g16_rx1.A is also fixed.
@edhartnett You can test with this patch to bget.c:
--- a/src/clib/bget.c
+++ b/src/clib/bget.c
@@ -400,7 +400,7 @@
#include <config.h>
#include <pio.h>
#include <pio_internal.h>
-#ifdef PIO_USE_MALLOC
+#if PIO_USE_MALLOC
#include <stdlib.h>
#endif
@@ -614,7 +614,7 @@ bufsize requested_size;
int compactseq = 0;
#endif
-#ifdef PIO_USE_MALLOC
+#if PIO_USE_MALLOC
// if(requested_size>maxsize){
// maxsize=requested_size;
// printf("%s %d %d\n",__FILE__,__LINE__,maxsize);
@@ -847,7 +847,7 @@ bufsize size;
bufsize osize; /* Old size of buffer */
struct bhead *b;
-#ifdef PIO_USE_MALLOC
+#if PIO_USE_MALLOC
return(realloc(buf, size));
#endif
if ((nbuf = bget(size)) == NULL) { /* Acquire new buffer */
@@ -882,7 +882,7 @@ void *buf;
{
struct bfhead *b, *bn;
-#ifdef PIO_USE_MALLOC
+#if PIO_USE_MALLOC
// printf("bget free %d %x\n",__LINE__,buf);
free(buf);
return;
On my local machine, 66 tests failed out of 90: 2 - test_rearr (Failed) 13 - test_pioc (Failed) 14 - test_pioc_unlim (Failed) 15 - test_pioc_putget (Failed) 16 - test_pioc_fill (Failed) 17 - test_darray (Failed) 18 - test_darray_multi (Failed) 19 - test_darray_multivar (Failed) 20 - test_darray_1d (Failed) 21 - test_darray_3d (Failed) 22 - test_darray_uneven (Failed) 23 - test_decomps (Failed) 24 - pio_unit_test (Failed) 35 - pio_rearr (Failed) 36 - pio_rearr_opts (Failed) 37 - pio_rearr_opts2_1p (Failed) 38 - pio_rearr_opts2_3p (Failed) 39 - pio_rearr_opts2_4p (Failed) 40 - pio_decomp_tests_1p (Failed) 41 - pio_decomp_tests_2p (Failed) 42 - pio_decomp_tests_3p (Failed) 43 - pio_decomp_tests_4p_1agg (Failed) 44 - pio_decomp_tests_4p_2agg (Failed) 45 - pio_decomp_tests_4p_3agg (Failed) 46 - pio_decomp_tests_4p_1iop (Failed) 47 - pio_decomp_tests_4p_2iop (Failed) 48 - pio_decomp_tests_4p_3iop (Failed) 49 - pio_decomp_tests_4p_2iop_2str (Failed) 50 - pio_decomp_tests_4p_2iop_1agg (Failed) 51 - pio_decomp_tests_1d_1p (Failed) 52 - pio_decomp_tests_1d_2p (Failed) 53 - pio_decomp_tests_1d_3p (Failed) 54 - pio_decomp_tests_1d_4p_1agg (Failed) 55 - pio_decomp_tests_1d_4p_2agg (Failed) 56 - pio_decomp_tests_1d_4p_3agg (Failed) 57 - pio_decomp_tests_1d_4p_1iop (Failed) 58 - pio_decomp_tests_1d_4p_2iop (Failed) 59 - pio_decomp_tests_1d_4p_3iop (Failed) 60 - pio_decomp_tests_1d_4p_2iop_2str (Failed) 61 - pio_decomp_tests_1d_4p_2iop_1agg (Failed) 62 - pio_decomp_tests_2d_1p (Failed) 63 - pio_decomp_tests_2d_2p (Failed) 64 - pio_decomp_tests_2d_3p (Failed) 65 - pio_decomp_tests_2d_4p_1agg (Failed) 66 - pio_decomp_tests_2d_4p_2agg (Failed) 67 - pio_decomp_tests_2d_4p_3agg (Failed) 68 - pio_decomp_tests_2d_4p_1iop (Failed) 69 - pio_decomp_tests_2d_4p_2iop (Failed) 70 - pio_decomp_tests_2d_4p_3iop (Failed) 71 - pio_decomp_tests_2d_4p_2iop_2str (Failed) 72 - pio_decomp_tests_2d_4p_2iop_1agg (Failed) 73 - pio_decomp_tests_3d_1p (Failed) 74 - pio_decomp_tests_3d_2p (Failed) 75 - pio_decomp_tests_3d_3p (Failed) 76 - pio_decomp_tests_3d_4p_1agg (Failed) 77 - pio_decomp_tests_3d_4p_2agg (Failed) 78 - pio_decomp_tests_3d_4p_3agg (Failed) 79 - pio_decomp_tests_3d_4p_1iop (Failed) 80 - pio_decomp_tests_3d_4p_2iop (Failed) 81 - pio_decomp_tests_3d_4p_3iop (Failed) 82 - pio_decomp_tests_3d_4p_2iop_2str (Failed) 83 - pio_decomp_tests_3d_4p_2iop_1agg (Failed) 84 - pio_decomp_frame_tests (Failed) 85 - pio_decomp_fillval (Failed) 89 - examplePio (Failed) 90 - example1 (Failed)
Yes, as I said, I get THAT they fail. I don't get WHY they fail.
They seem to fail when the final free is made on CN_bpool in PIOc_finalize().
@edhartnett Maybe we can pick one failed test (the simplest one if possible) to take a close look. It is possible that fixing that single test will fix most or all of the failed tests, if the cause are the same.
I agree - I've traced the problem to the PIOc_finalize, so fixing one is likely to fix them all. Still working on it but have a meeting now...
Yes, I have isolated one case.
Valgrind tells me this:
==8331== Invalid free() / delete / delete[] / realloc() ==8331== at 0x4C2CD5A: free (vg_replace_malloc.c:530) ==8331== by 0x44B3AC: free_cn_buffer_pool (pio_darray_int.c:1310) ==8331== by 0x4257C8: PIOc_finalize (pioc.c:881) ==8331== by 0x41A443: main (test_rearr.c:587) ==8331== Address 0x90d6040 is 0 bytes inside a block of size 33,554,432 free'd ==8331== at 0x4C2CD5A: free (vg_replace_malloc.c:530) ==8331== by 0x4307D3: brel (bget.c:1023) ==8331== by 0x41A119: test_compute_maxIObuffersize (test_rearr.c:361) ==8331== by 0x41A42B: main (test_rearr.c:543) ==8331== Block was alloc'd at ==8331== at 0x4C2BBAD: malloc (vg_replace_malloc.c:299) ==8331== by 0x446EAB: compute_buffer_init (pio_darray_int.c:46) ==8331== by 0x425292: PIOc_Init_Intracomm (pioc.c:727) ==8331== by 0x41A3F7: main (test_rearr.c:527)
The free call in pio_darray_int is redundant. PR on the way.
Where does CN_bpool get released (if we comment out free(CN_bpool))? We could be mixing brels and frees.
Yes, I took out that free, and got errors elsewhere.
Where is the other free happening?
If that redundant free call is removed, only test_pioc will fail.
OK, the test_pioc.c failure can be resolved by adding a line of code that will turn off async testing for writing the netCDF decomposition file. Apparently there is a memory issue that I need to hunt down there, but that's no reason to hold back this fix.
I have a PR with the necessary changes. I will put it up...
@dqwu : Did you try a valgrind test (to make sure that after removing the free() call we are not leaking mem)?
I did a valgrind test on test_pioc, not the rest.
@jayeshkrishna I tested PR #879 on ANL workstation. Both LeakSanitizer and Valgrind builds show new leaks in some tests, including the first test (test_spmd). This is expected. After all, we have not been testing bget package since PR #122.
Should we merge PR #879 to develop/master first and continue to create a new issue for the leaks? Or should we fix these leaks in PR #879 (update the PR)?
Please document these leaks with one or more new issues.
My worry is that we are masking the issue by commenting the free (when we comment the free(CN_bpool) we don't encounter some issues in the code).
I think this issue illustrates the dangers of using preprocessor commands to create logic. I suggest using CMake to set a value (i.e.., true or false) and set a boolean parameter to that value. Then, replace all of the preprocessor 'logic' with tests of this parameter. The result is less uncompiled (or conditionally compiled) code and better test coverage.
@jayeshkrishna We can look at one test (e.g. test_spmd) that has new leaks (apply PR #879 first), and see if we can find out a way to fix the leaks, without having to call "free(CN_bpool)". Maybe the leaks come from other sources. Valgrind or LeakSanitizer should provide more detailed information.
@jayeshkrishna I don't think we are masking anything here - the valgrind output indicates that the memory was released in the brel call and the following free call is unneeded.
@jayeshkrishna If we keep "free(CN_bpool)", we have 66 tests failed with segmentation faults. It seems that brel already freed it, and double free will cause undefined issues.
If we remove it, we only have 20 tests with new memory leaks. This is a much better result, and we can create a new issue to investigate and fix the leaks. 1 - test_spmd 7 - test_iosystem2_simple 8 - test_iosystem2_simple2 9 - test_iosystem2 10 - test_iosystem3_simple 11 - test_iosystem3_simple2 12 - test_iosystem3 25 - init_finialize_1_proc 26 - init_finialize_2_proc 27 - init_finalize_2_proc_with_args 28 - pio_file_simple_tests 29 - pio_file_fail 30 - ncdf_simple_tests 31 - ncdf_get_put_1proc 32 - ncdf_get_put_2proc 33 - ncdf_fail 34 - ncdf_inq 86 - pio_iosystem_tests 87 - pio_iosystem_tests2 88 - pio_iosystem_tests3
@jayeshkrishna It is noticed that the 66 failed tests (call "free(CN_bpool)") and the 20 tests that have new leaks do not overlap.
It seems that due to some logic, for the 66 tests, free(CN_bpool) is already called by brel before, but for the 20 tests, it is not.
So we cannot simply keep or remove the call to "free(CN_bpool)". We need find out if that free was called before or not.
Yes I agree that we have not solved this issue yet...
Okay - I agree and I don't see anything in bpoolrelease which would free CN_bpool, in fact CN_bpool is not even passed in.
@jedwards4b It is passed with a function pointer (relfcn)
static void (*relfcn) _((void *buf)) = NULL;
...
void bectl(compact, acquire, release, pool_incr)
...
bectl(NULL, malloc, free, pio_cnbuffer_limit);
...
relfcn = release;
...
if (relfcn != NULL &&
((bufsize) b->bh.bsize) == (pool_len - sizeof(struct bhead))) {
...
(*relfcn)(b);
}
The issue is that for the 66 tests, brel is called which implicitly calls free(CN_bpool), so we cannot call free(CN_bpool) explicitly later. For the 20 tests, brel is never called, so we need to call free(CN_bpool) explicitly to avoid memory leaks.
@jayeshkrishna The issue is a little clearer now.
For some tests (66 are known), brel() is called, which implicitly calls free(CN_bpool), so we cannot call free twice.
However, for some other tests (20 are known), brel() is never called (still do not know why, we need to find the reason), so there is no issue of implicitly calling of free(CN_bpool).
I'm now finding a different result looking at test_pioc.c - if I comment out the test_malloc_iodesc2 everything else seems to work with the free(CN_bpool) in place. I'm still digging but I think that the issue is elsewhere.
ok, I might need to dig a little deeper but my take on this issue is this,
void bpool_free(void *p)
{
free(p);
if(p == CN_bpool){
CN_bpool = NULL;
}
}
bectl(NULL, malloc, bpool_free, pio_cnbuffer_limit);
@jayeshkrishna I think that you've got it. I just tried your suggestion and all is good now.
All 90 tests pass for me.
Thanks
Oh, this is awesome. Thanks Jayesh. I will put up a PR with this fix.
The option PIO_USE_MALLOC is turned off by default:
option (PIO_USE_MALLOC "Use native malloc (instead of bget package)" OFF)
Based on this option, in config.h we will have a macro defined:
#define PIO_USE_MALLOC 0
or#define PIO_USE_MALLOC 1
However, in bget.c, this macro is incorrectly tested (it is always defined, no matter the value is 0 or 1)
In pio_darray_int.c, the macro is correctly tested (check whether the value of the macro is 0 or 1, instead of whether it is defined or not)
[Some comments] 1) Before PR #122, config.h is not included in bget.c, so the macro is not defined at that time in that source file. 2) For now, suppose the option PIO_USE_MALLOC is OFF by default when we run CMake, in bget.c native malloc is always used. However, in pio_darray_int.c, native malloc is not used. This inconsistency might cause some problems. 3) This issue affects test coverage. As we can see in bget.c, bget memory management will never be used 4) This means we have not been testing bget memory management since PR #122.
[How to fix] The fix is straightforward, just update bget.c to test the value of PIO_USE_MALLOC as in pio_darray_int.c
[Caution] After the fix is applied, a lot of unit tests will fail. This is expected, as we have not tested them since PR #122. Also, some cime_developer tests will fail if we apply this updated PIO2 to ESMCI CIME. We should continue to fix the failing tests.