compdyn / partmc

Particle-resolved stochastic atmospheric aerosol model
http://lagrange.mechse.illinois.edu/partmc/
GNU General Public License v2.0
27 stars 15 forks source link

Automatically retry failing tests up to 10 times each #154

Closed mwest1066 closed 3 years ago

mwest1066 commented 3 years ago

PartMC simulations are inherently stochastic, which makes it hard to write test that never fail. This has caused us problems with CI because one or more tests will often fail at random, causing the entire CI run to fail. We could make the tests less sensitive, but this would also make it harder for them to detect actual bugs.

This PR adds a wrapper script test/run_test_directory.sh which will run all the tests in a subdirectory and automatically retry them if any fail, up to a maximum of 10 times. This should hopefully reduce the false-negative results from CI.

mwest1066 commented 3 years ago

After I wrote this PR I noticed that @mattldawson had PR #64 to do the same thing. Now I feel really bad for having neglected that PR for so long. Sorry Matt!

One key difference between #64 and this PR is that #64 retried each test one-by-one. For example, if test_loss_06 failed then it would be immediately retried. While this is more efficient, it assumes the tests are independent. Unfortunately many of our tests are dependent on earlier tests. For example, test_loss_06 is dependent on test_loss_04: https://github.com/compdyn/partmc/blob/c89d2b43f81ccbc6381488a34ce94afdd4a29198/CMakeLists.txt#L194

This means that the test_loss_06 failure might actually be due to bad data being generated by test_loss_04 (even if test_loss_04 itself passed), so just re-running test_loss_06 won't be enough to fix the problem. Ideally we would use the dependency information in CMakeLists.txt to re-run dependent tests on failures.

However, I couldn't think of a simple way to do this without writing a much more complicated test-runner. Instead, this PR runs all tests in a given directory (e.g., all the loss/test_loss_XX.sh tests) and checks to see whether any of them failed. If there were any failures, then it re-runs the entire directory again. This ensures that we will re-run test_loss_04 after test_loss_06 fails, for example.

mattldawson commented 3 years ago

Hi @mwest1066, no problem - I forgot about that PR. But, I did add logic to the tests to re-run the needed parts of previous tests if a test fails. For example, in test_loss_6.sh:

if ! ../../extract_aero_size --mass --dmin 1e-9 --dmax 1e-3 --nbin 160 out/loss_part_volume_0001 || \
   ! ../../extract_sectional_aero_size --mass out/loss_exact_volume || \
   ! ../../numeric_diff --by col --rel-tol 0.3 out/loss_exact_volume_aero_size_mass.txt out/loss_part_volume_0001_aero_size_mass.txt; then
      echo Failure "$counter"
      if [ "$counter" -gt 10 ]
      then
          echo FAIL
          exit 1
      fi
      echo retrying...
      if ! ../../partmc run_constant_part.spec; then continue; fi
      if ! ../../partmc run_constant_exact.spec; then continue; fi
          if ! ../../partmc run_volume_part.spec; then continue; fi
          if ! ../../partmc run_volume_exact.spec; then continue; fi
  else
      echo PASS
      exit 0
  fi
  ((counter++))
done

At the time, I wasn't sure if I caught all the dependencies, but this has been working in the chem_mod for quite a while with no failures, and it doesn't take too long to run.

mwest1066 commented 3 years ago

Hi @mwest1066, no problem - I forgot about that PR. But, I did add logic to the tests to re-run the needed parts of previous tests if a test fails. For example, in test_loss_6.sh:

Oh nice! I completely missed this when I read your PR. Double apologies now! 😄 I think you did catch all of the dependencies.

What do you think about the approaches of #64 (explicitly put the dependencies into each test) versus #154 (re-run the whole directory of tests on failure)? I think that #64 is more efficient on a failure, because it will re-run only exactly the tests that are needed, while #154 keeps the test code shorter and more simple.

mattldawson commented 3 years ago

one other very small advantage of #64 is that it makes it a little more clear what dependencies exist among the tests. this took me a little bit of time to track down at first. but, either option seems good to me.

mwest1066 commented 3 years ago

What do you think about the approaches of #64 (explicitly put the dependencies into each test) versus #154 (re-run the whole directory of tests on failure)?

(answering my own question)

I think I have a slight preference for keeping the retry logic outside of the individual test scripts. When running tests by hand (during development or debugging), it's nice to be able to explicitly run the test just once, without having the retry logic kick in on every failure. Of course we could add a command line argument to the test files so that we could skip the retry logic, but that's even more complexity 😄.

On the other hand, with #154 we don't have the dependency relationships between the tests explicitly recorded anywhere, so if you are running tests by hand then you need to know a little bit more about what you are doing if there are failures.

mwest1066 commented 3 years ago

one other very small advantage of #64 is that it makes it a little more clear what dependencies exist among the tests. this took me a little bit of time to track down at first. but, either option seems good to me.

Ha, yes, I just made this same comment above 😄

I agree that either option is much better than the current state of affairs!

mattldawson commented 3 years ago

we could also just add the dependencies to comments in the scripts and leave out the retry logic.

I have noticed that when running the full test suite, you do notice when the retries happen because it takes longer. I wonder what the difference in run time is between these two approaches? I don't have a good feel for how long parts of previous tests that are not needed by subsequent tests take to run.

mattldawson commented 3 years ago

I think I have a slight preference for keeping the retry logic outside of the individual test scripts. When running tests by hand (during development or debugging), it's nice to be able to explicitly run the test just once, without having the retry logic kick in on every failure.

But, how would you know if the test failed because of a bug or because of these random failures without the retry logic?

mwest1066 commented 3 years ago

we could also just add the dependencies to comments in the scripts and leave out the retry logic.

Yep, that's quite viable.

I have noticed that when running the full test suite, you do notice when the retries happen because it takes longer. I wonder what the difference in run time is between these two approaches? I don't have a good feel for how long parts of previous tests that are not needed by subsequent tests take to run.

I think that the run time difference is normally small. Most of the tests are set up like:

In this setup, test_1 is the slow one and all the others are fast, but if any of the later tests fail then we need to re-run test_1 to generate the data again.

mattldawson commented 3 years ago

In this setup, test_1 is the slow one and all the others are fast, but if any of the later tests fail then we need to re-run test_1 to generate the data again.

That makes sense - I could tell what was being run, but I didn't have this higher-level understanding of what was being done at each step. So, then I like #154 (with maybe optionally listing the dependencies somewhere for us newbies)

mwest1066 commented 3 years ago

I think I have a slight preference for keeping the retry logic outside of the individual test scripts. When running tests by hand (during development or debugging), it's nice to be able to explicitly run the test just once, without having the retry logic kick in on every failure.

But, how would you know if the test failed because of a bug or because of these random failures without the retry logic?

That's a fair question. I think that in practice what I do is re-run all of the tests. I frequently run command lines like:

mattldawson commented 3 years ago

I think I have a slight preference for keeping the retry logic outside of the individual test scripts. When running tests by hand (during development or debugging), it's nice to be able to explicitly run the test just once, without having the retry logic kick in on every failure.

But, how would you know if the test failed because of a bug or because of these random failures without the retry logic?

That's a fair question. I think that in practice what I do is re-run all of the tests. I frequently run command lines like:

  • for f in test_loss*.sh ; do ./$f ; done
  • for f in plot_*.gnuplot ; do gnuplot $f ; done

I'm very much ok with #154, but if I see a hair I have to split it. So, in this scenario, if the retry logic is in each script, this should fail only if there is a bug. If the retry logic is not in each script and this fails, do you manually rerun until it doesn't fail or you are convinced it's failing because of a bug? is this preferred because you want the output of each test to be based on the same random input data?

mwest1066 commented 3 years ago

I'm very much ok with #154, but if I see a hair I have to split it.

😄

So, in this scenario, if the retry logic is in each script, this should fail only if there is a bug. If the retry logic is not in each script and this fails, do you manually rerun until it doesn't fail or you are convinced it's failing because of a bug? is this preferred because you want the output of each test to be based on the same random input data?

That's right. When there is a failure I want to run the data-generating test scripts, then all the analysis test scripts, then all the plotting scripts, and I want them to all be using the exact same random data so that the plots will give me an insight into what happened with this simulation run. For example, I don't want the number distributions output by test_2 to be showing data from a different run than the mass distributions output by test_3.

mwest1066 commented 3 years ago

@mattldawson thanks for the great discussion. I feel like we are on the same page that either #64 or #154 will be good. Sorry again for duplicating your work!

I'm going to merge #154 now because I think it fits just slightly better with the way I develop/debug tests.