Non-deterministic failure of test_kraken_multi teardown

dpark01 commented 6 years ago

Sometimes Travis fails, sometimes it succeeds, on this particular part of the test suite. Probably some race condition as xdist parallelizes the tests.

Mysteries include:

Why is the existence of libgkl_compression2251932993111688424.so non-deterministic?
Why is there a weird number in that filename?
Why is os.unlink trying to use a compression library to delete files?
Will this problem go away by itself? If so, will that make us feel better or worse about it?

_______ ERROR at teardown of test_kraken_multi[TestMetagenomicsViralMix] _______
[gw1] linux -- Python 3.6.5 /home/travis/build/broadinstitute/viral-ngs/tools/conda-tools/default/bin/python
def reset():
>       shutil.rmtree(new_tempdir)
conftest.py:64: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tools/conda-tools/default/lib/python3.6/shutil.py:480: in rmtree
    _rmtree_safe_fd(fd, path, onerror)
tools/conda-tools/default/lib/python3.6/shutil.py:438: in _rmtree_safe_fd
    onerror(os.unlink, fullname, sys.exc_info())
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
topfd = 11, path = '/tmp/pytest-of-travis/pytest-0/popen-gw1/test-function0'
onerror = <function rmtree.<locals>.onerror at 0x7f004cc5f1e0>
    def _rmtree_safe_fd(topfd, path, onerror):
        names = []
        try:
            names = os.listdir(topfd)
        except OSError as err:
            err.filename = path
            onerror(os.listdir, path, sys.exc_info())
        for name in names:
            fullname = os.path.join(path, name)
            try:
                orig_st = os.stat(name, dir_fd=topfd, follow_symlinks=False)
                mode = orig_st.st_mode
            except OSError:
                mode = 0
            if stat.S_ISDIR(mode):
                try:
                    dirfd = os.open(name, os.O_RDONLY, dir_fd=topfd)
                except OSError:
                    onerror(os.open, fullname, sys.exc_info())
                else:
                    try:
                        if os.path.samestat(orig_st, os.fstat(dirfd)):
                            _rmtree_safe_fd(dirfd, fullname, onerror)
                            try:
                                os.rmdir(name, dir_fd=topfd)
                            except OSError:
                                onerror(os.rmdir, fullname, sys.exc_info())
                        else:
                            try:
                                # This can only happen if someone replaces
                                # a directory with a symlink after the call to
                                # stat.S_ISDIR above.
                                raise OSError("Cannot call rmtree on a symbolic "
                                              "link")
                            except OSError:
                                onerror(os.path.islink, fullname, sys.exc_info())
                    finally:
                        os.close(dirfd)
            else:
                try:
>                   os.unlink(name, dir_fd=topfd)
E                   FileNotFoundError: [Errno 2] No such file or directory: 'libgkl_compression2251932993111688424.so'
tools/conda-tools/default/lib/python3.6/shutil.py:436: FileNotFoundError

notestaff commented 6 years ago

i've seen this, have fixes, will check in soon.

tomkinsc commented 6 years ago

Is it a race condition of multiple threads trying to remove the same shared object file? Guessing this call is to blame? https://github.com/broadinstitute/viral-ngs/blob/master/tools/kraken.py#L168

dpark01 commented 6 years ago

Oh... maybe the Picard pipes aren't being closed properly before end-of-method and the test fixture blows it away before Picard was done?

notestaff commented 6 years ago

i think it has to do with tmpdir_function and tmpdir_module fixtures in conftest.py . reading the docs of tmpfile_factory, that way of creating tempdirs is less robust than with mkdtemp. also, the rmtree calls don't check if the tree still exists before trying to remove it.

dpark01 commented 6 years ago

Maybe we should add a .poll() on each of the fastq_pipes after this line: https://github.com/broadinstitute/viral-ngs/blob/master/tools/kraken.py#L173

notestaff commented 6 years ago

maybe there are two separate issues here :)

dpark01 commented 6 years ago

Yes it could be all of the above. The reason I'm thinking @tomkinsc 's highlighted line of code might be at play is because if you google around for libgkl_compression.so, it's all Picard/GATK-related. So I think the shutil.rmtree is happening concurrently with Picard trying to use some files in the directory tree being blown away.

notestaff commented 6 years ago

simplest fix is to just check if tree exists before shutil.rmtree(new_tempdir) . but i think it's better to rewrite to use mkdtemp

dpark01 commented 6 years ago

I bet both bits of code have issues that only really present as noticable problems when in combination.

I like using mkdtemp but I think it's still good to rmtree at the end of a test fixture because as the test suite increases in size, you run out of local tmp space if you don't clean up after big tests. Can't we just run shutil.rmtree(new_tempdir, ignore_errors=True)? (equivalent to rm -rf).

BTW I'm not sure if the problem is that the directory doesn't exist vs. a file that was originally in the directory tree (when rmtree started walking the tree) disappeared (because it was a Picard temp file) before rmtree got around to unlinking that particular file. So rm -rf would make the test succeed, but it's probably still a good idea to make sure kraken doesn't return from its method invocation before all its pipes are closed out and cleaned up.

notestaff commented 6 years ago

maybe util.file.fifo() should poll the pipes after yield, before unlinking them?

notestaff commented 6 years ago

and/or add a small time delay before unlinking pipes and returning.

dpark01 commented 6 years ago

Actually I just realized that the kraken code was using util.file.fifo which makes things a bit tricky. These aren't really pipes, these are named pipes / fifos that behave like files. There's no way to poll or close these manually that I know of. Because our python code never opened them, we just passed the strings (filenames) to other programs that handled all that.

The Picard side of the pipe can happily play with standard python pipe-fitting (it can interleave the fastqs and is often used that way to pipe to bwa). It's the kraken side that has issues. Maybe we can revert the code here to use temp files again instead of fifos until we eventually tackle the larger question of how to clean up the pipe-fitting while also adding unpaired read support (#820). Then again, I'm not quite sure why reverting to temp files would solve the race condition... hm.

Or time delay...

notestaff commented 6 years ago

adding a 2s delay at the end of metagenomics.diamond() does fix one transient bug in my experiments on AWS.

dpark01 commented 6 years ago

Wow, 2s is much longer than I would've tried (maybe 100ms)..

Oh wait! We're calling picard.execute(background=True) but ignoring the return value. That return value is a Popen object that we can do the proper if picard_ps.poll(): raise subprocess.CalledProcessError(...) thing with. The fact that we're ignoring it means there's probably nothing really forcing it to close properly...

yesimon commented 6 years ago

I’ve had nondeterministic test failures in the past but never this specific one. I think it’s highly likely to be related to some combination of xdist, temp directory removal, and having background processes. I don’t think the named fifos should be causing problems because they are essentially real files to the program as long as they don’t try to seek or tell.

I’ve tried in the past to add ps.wait commands to finish up all the background processes but it didn’t seem to have any effect. Regardless, adding it in should not cause any harm.

Mainly trying to glue together std streams of subprocesses through python seemed to cause the most issues, so that’s something that I try to avoid and use named fifos instead.

broadinstitute / viral-ngs

Non-deterministic failure of test_kraken_multi teardown #839