braddr / d-tester

Automated testing for github projects.
http://d.puremagic.com/test-results/
11 stars 5 forks source link

Mitigate Low Memory Intermittent Failures #69

Closed marler8997 closed 6 years ago

marler8997 commented 6 years ago

On linux we have started seeing processes being killed intermittently in the "test phobos" stage (see https://github.com/braddr/d-tester/issues/68). The consensus is this is being caused by the OOM killer. I suggest attempting to tackle this problem at both ends, both by determining if we can decrease memory usage in the tests themselves, and also if we can configure the autotester to give phobos some more "breathing room". There may be a way to configure the OOM, or possibly add more swap space so we aren't always on the edge of intermittent failure. This may also be mitigated using other mechanisms, such as limiting the number of parallel threads allowed to run at one time, or possibly restricting the number of parallel jobs being run on a single machine. I'm not familiar with how jobs and threads are distributed so I can't weigh in on what solutions are available, so I ask those more familiar to consider what solutions we may have available.

braddr commented 6 years ago

I disagree. The amount of memory on each tester system hasn't changed. They have a minimum of 2G each, some much more. That the testers occasionally run out of memory is the fault of the system under test, not the system doing the test. This is a proto-typical regression situation. The affect extends beyond the tester into the developers using the compiler. Take the reports as what they are, a problem to be solved in the system under test.

marler8997 commented 6 years ago

I disagree. The amount of memory on each tester system hasn't changed.

Well, the problem seems to be about more than just available memory. It could be that too many threads are being allowed to run in parallel. However, I can't really determine all the factors that play into it since I'm not familiar with the implementation. You being the most familair, I was hoping you could shed some light and give some ideas as to how we can fix this. I'm not saying that there is a problem in phobos, or the autotester, I'm just trying to understand where we can find a solution.

braddr commented 6 years ago

Every tester has at least 2 cores and 2 gigs of memory. Every tester runs with -J#cores (One might run with N-2 since it's a shared box but it has 48G of ram). Every box runs a single pull at a time. Every box has some amount of swap (not sure how much for each without doing a survey).

While I agree that some tuning might be able to squeak by better, what I'm trying to say is that the situation has gotten worse and the affects of that extend beyond just the tester. Changing the test environment, presuming it helps at all, doesn't mitigate the fact that the code base has gotten worse and demands more peak resources from everyone using it. If you want to mitigate it, and there's just one test that fails, the better path, perhaps, is disable that test.

I don't consider reduction of parallelism a viable path, the system is already fairly slow and that'd cost 50-100% additional time.

marler8997 commented 6 years ago

This is very helpful information, thanks. The solution you've mentioned is to improve the code base, and I agree that solution should be explored. And I take back what I said about looking into configuring the OOM, you should be able to build/test phobos on a 2GB machine without having to tweek the kernel. So I agree, the autotester is not in the wrong here.

That being said, maybe you have ideas to help phobos fix this problem? It appears the problem we are having is that as phobos grows, the memory required to run the tests is also growing. This is a problem with the test mechanisms, orthogonal to the problems with the code base itself. Even if phobos doubled in size, it shouldn't require more memory to test it (Note that I'm saying "require", meaning not having the required memory would cause the test to crash).

I understand we don't want to reduce parallelism...maybe we could monitor the test processes and detect OOM errors and restart the test later in those cases? I don't know, I'm just brainstorming here.