Potential bug in Synchronizer.get_failed_()

I'm considering adding TensorFlow bindings to persistent-rnn, and I've been reading the codebase for my own understanding.

In https://github.com/baidu-research/persistent-rnn/blob/master/include/prnn/detail/rnn/synchronizer.h#L86

    bool get_failed_() const {
        index_t failed = 0;

        prnn::parallel::CudaRuntimeLibrary::cudaMemcpyAsync(&failed,
            get_barrier_failed_flag(), sizeof(index_t),
            prnn::parallel::CudaRuntimeLibrary::cudaMemcpyDefault, stream_);

        return failed != 0;
    }

It seems to me that this has has a race condition between function return (failed going out of scope) and cudaMemcpyAsync completing. Is this the case? If not, what is causing synchronization of the memcpy before function return? There are several other similar examples to this in Synchronizer.

Thanks for any comment on this "maybe not an issue but just a question about code" issue :)

baidu-research / persistent-rnn

Potential bug in Synchronizer.get_failed_() #15