Closed dpark01 closed 6 years ago
i've seen this, have fixes, will check in soon.
Is it a race condition of multiple threads trying to remove the same shared object file? Guessing this call is to blame? https://github.com/broadinstitute/viral-ngs/blob/master/tools/kraken.py#L168
Oh... maybe the Picard pipes aren't being closed properly before end-of-method and the test fixture blows it away before Picard was done?
i think it has to do with tmpdir_function and tmpdir_module fixtures in conftest.py . reading the docs of tmpfile_factory, that way of creating tempdirs is less robust than with mkdtemp. also, the rmtree calls don't check if the tree still exists before trying to remove it.
Maybe we should add a .poll()
on each of the fastq_pipes after this line:
https://github.com/broadinstitute/viral-ngs/blob/master/tools/kraken.py#L173
maybe there are two separate issues here :)
Yes it could be all of the above. The reason I'm thinking @tomkinsc 's highlighted line of code might be at play is because if you google around for libgkl_compression.so, it's all Picard/GATK-related. So I think the shutil.rmtree
is happening concurrently with Picard trying to use some files in the directory tree being blown away.
simplest fix is to just check if tree exists before shutil.rmtree(new_tempdir) . but i think it's better to rewrite to use mkdtemp
I bet both bits of code have issues that only really present as noticable problems when in combination.
I like using mkdtemp but I think it's still good to rmtree at the end of a test fixture because as the test suite increases in size, you run out of local tmp space if you don't clean up after big tests. Can't we just run shutil.rmtree(new_tempdir, ignore_errors=True)
? (equivalent to rm -rf
).
BTW I'm not sure if the problem is that the directory doesn't exist vs. a file that was originally in the directory tree (when rmtree
started walking the tree) disappeared (because it was a Picard temp file) before rmtree
got around to unlinking that particular file. So rm -rf
would make the test succeed, but it's probably still a good idea to make sure kraken doesn't return from its method invocation before all its pipes are closed out and cleaned up.
maybe util.file.fifo() should poll the pipes after yield, before unlinking them?
and/or add a small time delay before unlinking pipes and returning.
Actually I just realized that the kraken code was using util.file.fifo
which makes things a bit tricky. These aren't really pipes, these are named pipes / fifos that behave like files. There's no way to poll or close these manually that I know of. Because our python code never opened them, we just passed the strings (filenames) to other programs that handled all that.
The Picard side of the pipe can happily play with standard python pipe-fitting (it can interleave the fastqs and is often used that way to pipe to bwa). It's the kraken side that has issues. Maybe we can revert the code here to use temp files again instead of fifos until we eventually tackle the larger question of how to clean up the pipe-fitting while also adding unpaired read support (#820). Then again, I'm not quite sure why reverting to temp files would solve the race condition... hm.
Or time delay...
adding a 2s delay at the end of metagenomics.diamond() does fix one transient bug in my experiments on AWS.
Wow, 2s is much longer than I would've tried (maybe 100ms)..
Oh wait! We're calling picard.execute(background=True)
but ignoring the return value. That return value is a Popen object that we can do the proper if picard_ps.poll(): raise subprocess.CalledProcessError(...)
thing with. The fact that we're ignoring it means there's probably nothing really forcing it to close properly...
I’ve had nondeterministic test failures in the past but never this specific one. I think it’s highly likely to be related to some combination of xdist, temp directory removal, and having background processes. I don’t think the named fifos should be causing problems because they are essentially real files to the program as long as they don’t try to seek or tell.
I’ve tried in the past to add ps.wait commands to finish up all the background processes but it didn’t seem to have any effect. Regardless, adding it in should not cause any harm.
Mainly trying to glue together std streams of subprocesses through python seemed to cause the most issues, so that’s something that I try to avoid and use named fifos instead.
Sometimes Travis fails, sometimes it succeeds, on this particular part of the test suite. Probably some race condition as xdist parallelizes the tests.
Mysteries include:
libgkl_compression2251932993111688424.so
non-deterministic?os.unlink
trying to use a compression library to delete files?