`test_load_balancer_keep_last_elm` test fails occasionally #1395

cz4rs commented 3 years ago

Describe the bug vt:*/TestLoadBalancer.test_load_balancer_keep_last_elm/*_proc_2 test fails randomly in CI.

This can be reproduced in local using: ctest -R TestLoadBalancer.test_load_balancer_keep_last_elm --output-on-failure --repeat-until-fail 100


nlslatt commented 3 years ago

I have concerns about this chunk of code:

    max_bytes*load.size(),max_bytes,nullptr,[&](NodeType node, void* ptr){
      auto ptr_out = reinterpret_cast<GreedyLBTypes::ObjIDType*>(ptr);
      auto const& proc = node_transfer[node];
      auto const& rec_size = proc.size();
      ptr_out->id = rec_size;
      for (size_t i = 0; i < rec_size; i++) {
        *(ptr_out + i + 1) = proc[i];

I'm concerned that I may have borked GreedyLB when I removed temporary IDs from the load balancers. Can @lifflander or @PhilMiller take a look and let me know how they think this was supposed to work?

nlslatt commented 3 years ago

I have concerns about this chunk of code:

    max_bytes*load.size(),max_bytes,nullptr,[&](NodeType node, void* ptr){
      auto ptr_out = reinterpret_cast<GreedyLBTypes::ObjIDType*>(ptr);
      auto const& proc = node_transfer[node];
      auto const& rec_size = proc.size();
      ptr_out->id = rec_size;
      for (size_t i = 0; i < rec_size; i++) {
        *(ptr_out + i + 1) = proc[i];

I'm concerned that I may have borked GreedyLB when I removed temporary IDs from the load balancers. Can @lifflander or @PhilMiller take a look and let me know how they think this was supposed to work?

@lifflander and I discussed this and decided this chunk of code is correct.

nlslatt commented 3 years ago

This is a lot more difficult to reproduce on my Mac than I expected given how frequently I feel like it shows up on CI. It shows up for me on Mac best with 4 ranks. Maybe it would show up more frequently in Docker. You can run only this GreedyLB test by running something like below, but I recommend redirecting the output to file.

mpiexec -n 4 ./tests/collection_extended --gtest_filter=*/TestLoadBalancer.test_load_balancer_keep_last_elm/4 --gtest_repeat=1000
lifflander commented 3 years ago

@nlslatt I actually just commented out the other LBs to get it to reproduce easily, but I should have thought of that.

lifflander commented 3 years ago

So I've enabled address sanitizer and it comes out clean even when it fails.

However, I was able to print the data and it indeed looks corrupted:

vt: [0] (t) lb: LBManager::finishedLB, phase=2
vt: [0] (t) lb: BaseLB: Statistic=P_l:  max=3.82, min=3.35, sum=7.17, avg=3.59, var=0.05, stdev=0.23, nproc=2, cardinality=2 skewness=0.00, kurtosis=-2.75, npr=2, imb=0.06, num_stats=1
vt: [0] (t) lb: BaseLB: Statistic=O_l:  max=0.00, min=0.00, sum=0.01, avg=0.00, var=0.00, stdev=0.00, nproc=64, cardinality=64 skewness=2.71, kurtosis=7.53, npr=64, imb=6.42, num_stats=2
vt: [0] (t) lb: loadStats: load=3.35, total=7.17, avg=3.59, I=0.06,should_lb=true, auto=true, threshold=0.9350528485009639
vt: [1] (t) lb: migrateObjectTo, obj_id=25770721283, home=-24827, from=1, to=25687, found=false
vt: [1] (t) lb: transferMigrations: obj_id=(25770721283,-24827,1), to_node=25687
vt: [0] (t) lb: BaseLB: Statistic=P_l:  max=3.35, min=2.99, sum=6.34, avg=3.17, var=0.03, stdev=0.18, nproc=2, cardinality=2 skewness=0.00, kurtosis=-2.75, npr=2, imb=0.06, num_stats=2