pthread_create_blocking alternate api

gperciva commented 3 years ago

Alternate API (instead of #244). I like it much better; easier to reason about it.

I'm still not 100% confident about the internal implementation, though... it feels like it could be refactored with one or two more functions to make it more readable. OTOH, the last two times that I looked at it, I couldn't see any other "architectural" improvements.

gperciva commented 3 years ago

Updated with REBASE commits.

I'm inclined to just say that if pthread functions are breaking, we're probably going to end up with deadlocks sooner or later so we might as well just accept it.

I'm not certain that I follow. The whole point is that if we find that pthread functions are breaking, we don't run the provided routine, and pthead_create_blocking() returns with an error. In theory, the polling loop should discover that something's very wrong within 1 second, and quit.

Granted, I expect that in such a case, the calling code would just head to an exit(1).

Commit 8f9e8ed removes that error-handling code. I think it's fairly harmless and worth keeping, but up to you.

gperciva commented 3 years ago

I'm leaning towards reinstating the provided function being required to call

init_func(init_cookie);

The current code in this PR does what it says: it blocks until ${thread} has started. But that's different from blocking until ${start_routine} has started, and I can fairly reliably (maybe 50% of the time) have the pthread_create_blocking return before the ${start_thread} has begun:

0.00000000  1   main_start
0.00003700  1   thr_start
0.00004200  1   thr_stop
0.00003700  1   main_after_block
0.00007900  1   main_after_join
0.00000000  2   main_start
0.00008200  2   thr_start                    <---------- here
0.00008300  2   thr_stop
0.00007800  2   main_after_block             <---------- here
0.00009100  2   main_after_join
test_pthread_create_blocking: timing.c, 195
test_pthread_create_blocking: info2.thr_start is after info2.main_after_block

("<--- here" editorial marks)

Now, we could argue that ${start_routine} is very likely to start very soon after ${thread} has begun, so that's enough for pthread_create_blocking(). But that seems like a weak argument to me. If the caller can't rely on ${start_routine} to have begun, then why bother with pthread_create_blocking() at all?

cperciva commented 3 years ago

I thought the point of this was just to make sure that the thread ID was actually valid when the create function returns?

gperciva commented 3 years ago

... I can't believe that I forgot that. Huh.

cperciva commented 3 years ago

Why have a cleanup routine? The only difference between doing it here and having it in start_routine is in the case where some very basic pthread routines break, and that's basically a situation of "the OS is broken and there's nothing we can do to recover" anyway.

cperciva commented 3 years ago

... I mean, if pthread_mutex_* or pthread_cond_* aren't working, what's the odds that pthread_cleanup_* will work anyway?

gperciva commented 3 years ago

I'm imagining the scenario of the caller doing:

if (pthread_create_blocking(thr, ...) != 0)
    goto err0;
pthread_cancel(thr);

If pthread_cleanup is part of start_routine, then it may or may not be set up, so we're left with indeterminately-cleaned-up resources.

(not that it really matters, but I figure that if we're going to name something after pthread_create, it should be as bullletproof as possible)

gperciva commented 3 years ago

... on a related note, it occurred to me that pthread_ is probably a reserved identifier. (I haven't tracked down the exact reference yet.)

Should we call it something else like create_thread_blocking? I can't think of any good alternate name right now, but I'm reluctant to use a technically-invalid name.

gperciva commented 3 years ago

should be as bullletproof as possible

Just to elaborate on this... you've mused about other groups potentially using some libcperciva files (I think it was parsenum.h in particular). I'm thinking of pthread_create_blocking in a similar way, i.e. trying to make sure it works (as well as possible) in all situations, not just what we want for spiped in particular.

cperciva commented 3 years ago

I'm imagining the scenario of the caller doing:
if (pthread_create_blocking(thr, ...) != 0)
    goto err0;
pthread_cancel(thr);
If pthread_cleanup is part of start_routine, then it may or may not be set up, so we're left with indeterminately-cleaned-up resources.

Meh, if you call pthread_cancel it's up to you to ensure that you know what you're cancelling. In particular, the thread should set up its cleanup handlers appropriately with respect to creating resources; when the thread is first created there's nothing to clean up yet, of course.

cperciva commented 3 years ago

... on a related note, it occurred to me that pthread_ is probably a reserved identifier. (I haven't tracked down the exact reference yet.)

Should we call it something else like create_thread_blocking? I can't think of any good alternate name right now, but I'm reluctant to use a technically-invalid name.

Meh. Yes, it's reserved; but I'm not too concerned about this. Maybe call it pthread_create_blocking_np though; the _np suffix is widely used to denote non-POSIX extensions to pthreads.

gperciva commented 3 years ago

Ok, I'll remove the cleanup stuff again. I forgot that pthread_cancel only takes effect at a cancellation point, so the provided function can safely ensure that it sets up its own cleanup (if it wants one) before any cancellation points.

Can I rebase this branch? It's such a mess now, keeping REBASE commits doesn't make it easier to see what's happening.

gperciva commented 3 years ago

when the thread is first created there's nothing to clean up yet, of course.

As a quibble, that wouldn't allow a thread to take ownership of resources.

For example, in spipe/pushbits.c we have:

main: pushbits() allocates struct push * P.
thread: workthread() sets up workthread_cleanup, which does free(P).

But that's ok with the current design, because the ${start_routine} can set up its own cleanup function before any cancellation points.

cperciva commented 3 years ago

Yes, feel free to clean up the history. I'm only looking at the final version of the file right now anyway.

cperciva commented 3 years ago

when the thread is first created there's nothing to clean up yet, of course.

As a quibble, that wouldn't allow a thread to take ownership of resources.

It could, but I agree that it's more work. (e.g. by locking a shared mutex and setting a thread_has_taken_ownership variable.)

gperciva commented 3 years ago

Rebased branch, but included one MAYBE commit.

8 days ago, you said:

I'm inclined to just say that if pthread functions are breaking, we're probably going to end up with deadlocks sooner or later so we might as well just accept it. If you're really worried, mark the mutex as robust so that pthread_cond_wait will return if the spawned thread dies.

I don't think that I understand. As far as I can tell, as long as a breaking pthread function returns a non-zero value, we can at least ensure that pthread_create_blocking_np() returns non-zero.

I know that so far you've been looking at the overall "files changed", but please take a glance at this single commit: 071e164 MAYBE add polling check for signalling error

Without this, if there's a problem with pthread_lock or pthread_cond_signal, pthread_create_blocking_np() can block indefinitely because U->cond was never signalled.

With this commit, the worst case is that pthread_create_blocking_np() will return an error within a second of such a failure.

cperciva commented 3 years ago

I don't like the MAYBE. On the positive side, it would avoid a deadlock if... err... very low-level pthreads functions are broken? But on the downside, it could introduce failures if a system is so heavily loaded that it takes more than a second before a new thread starts running.

gperciva commented 3 years ago

Ok, I've reverted the MAYBE, and added REBASE commits for everything you noted.

gperciva commented 3 years ago

But on the downside, it could introduce failures if a system is so heavily loaded that it takes more than a second before a new thread starts running.

Wait, I don't understand this part.

If pthread_cond_timedwait in 071e164 ends with a ETIMEDOUT, we ignore the error and cond_timedwait returns with a 0 so there's no goto err5. U->rsync is also still a 0, so there's also no goto err5. U->running is still 0, so we start another 1-second timed wait. We keep on looping -- checking if there's an error every second -- until either the thread has started, or we've found an error.

(I've already reverted the commit in this branch; I'm asking for general pthread knowledge.)

gperciva commented 3 years ago

Added a MAYBE comment in c5900e43 to explain why we're not using polling.

cperciva commented 3 years ago

But on the downside, it could introduce failures if a system is so heavily loaded that it takes more than a second before a new thread starts running.

Wait, I don't understand this part.

If pthread_cond_timedwait in 071e164 ends with a ETIMEDOUT, we ignore the error and cond_timedwait returns with a 0 so there's no goto err5. U->rsync is also still a 0, so there's also no goto err5. U->running is still 0, so we start another 1-second timed wait. We keep on looping -- checking if there's an error every second -- until either the thread has started, or we've found an error.

(I've already reverted the commit in this branch; I'm asking for general pthread knowledge.)

Never mind, I was misreading the code. Yeah, not dangerous; just useless. In fact, even more useless than I realized at first, since pthread_cond_timedwait isn't guaranteed to return after the specified time; rather, after the specified time it tries to reacquire the mutex and blocks until that succeeds. So the case of the pthread library being broken and is likely to result in this polling not actually polling anyway.

cperciva commented 3 years ago

I think the MAYBE comment will add more confusion than it removes, at least among people familiar with pthreads. But the rest looks good; please rebase.

Maybe worth s/main thread/parent thread/ everywhere? Since conceivably any thread could spawn a new one. Up to you; both options are clear, I think.

gperciva commented 3 years ago

Rebased with "parent thread" instead of "main thread". Potentially ready for merge.

gperciva commented 3 years ago

Wait, I forgot to update "main -> parent" in the test file timing.c.

gperciva commented 3 years ago

Fixed, potentially ready for merge.

Tarsnap / spiped

pthread_create_blocking alternate api #327