Explicitly close some file handles

atoomic commented 3 years ago

After updating to v1.000038 from v1.000000 we can start noticing some Too many open files issues. I did a quick review of the change from v1.000000..v1.000038 and check all new introduced ~open and add an extra explicit close for them

exodist commented 3 years ago

Can you explain to me why this is necessary, with exception of $wfh in one of the files, most of these look like the scope should end the filehandle anyway, so they seem unnecessary. Am I wrong about when filehandles are cleaed up, is it at sub exit instead of scope exit?

atoomic commented 3 years ago

@exodist IMO the fh should also close when exiting the scope. So yes most of them are not necessary. As updated in the description, after updating to v1.000038 from v1.000000 we can start noticing some Too many open files issues. I did a quick pass to check all added open during these iterations and ensure they are close as soon as possible.

I agree some close are not necessary, but I also realized that the existing code was doing a pretty good job of using explicit close when needed, so I decided to add them too.

exodist commented 3 years ago

ah, that makes sense. Did this patch solve your "too many handles" issue?

atoomic commented 3 years ago

I'm currently working on this right now, the answer is not obvious as this is a flapping issue... that we start noticing since the update... so might need a few extra days to confirm if it fixes it or not

exodist commented 3 years ago

ok, either way I have no issues with this patch, I think it should be a no-op, explicit close is fine though too. But if you are still working on the problem I will keep it open for any additional patches you may want to attach.

atoomic commented 3 years ago

this is not good enough, I can still see some random issues while running our test suite need to keep investigating

toddr commented 3 years ago

I wonder if there's a refcount issue introduced recently that might be the source of the problem. is something keeping track of dead handles in a way that's maybe not getting freed?

FTR we have about 4000 test files and are running at -j12

toddr commented 3 years ago

just got:

19:31:42  Could not open events dir: Too many open files at Test2/Harness/Collector/JobDir.pm line 436, <$fh> line 1.

toddr commented 3 years ago

Actually the problem happened on 1.000000. So this may go back to a problem between 0.999009 and 1.000000

exodist commented 3 years ago

Are open files limited by user, process, or system?

toddr commented 3 years ago

On most Linux systems, the default for compiled binaries is a max of 1024 file handles per process.

toddr commented 3 years ago

Don't worry about this for now. I have a way to find the root cause on monday. it should be easy to detect. I just need to check /proc/$$/fd and find out which parent yath process is hoarding file descriptors since they're being passed to the children on fork.

toddr commented 3 years ago

I assume if I can tell you what files aren't closing, then it'll be easy to determine the root cause?

exodist commented 3 years ago

Potentially. I would focus on the collector and the runner, they are the ones that need the most filehandles. The runner creates the handles for the children when it spawns them, but should let them go right after. The collector opens all the files produced by tests, but should let them go when a test completes.

My hunch is that either the collector is not letting go of file handles when it is done with them (unexpected references kept?), Or with 12 processes running there is a combination of 12 of your tests that combined with what yath uses result in too many files. Maybe even 1 or 2 filehandle heavy tests?

atoomic commented 3 years ago

Thanks @exodist for these details. This is an issue from the collector when we run too many tests / too fast... The collector, just try to open every files without any limits...

We could limit the number of jobs opened to avoid(/fix) the issue Note that when using retry the collector is also keeping open FH... this is probably something we can ignore for now...

Making sure sub jobs stop polling from the queue when having for example 200 jobs fixes the issue. IMO this is fine to have that limit hard coded to 200 in the collector. This is not making the test run slower, just limiting the number of tests we can read at the same time... and as it s for human eyes... I m not sure I can catch up with 200 jobs lol

atoomic commented 3 years ago

The real fix is #203, this is just good to have

Test-More / Test2-Harness

Explicitly close some file handles #201