Closed gperciva closed 3 years ago
Updated with REBASE commits.
I'm inclined to just say that if pthread functions are breaking, we're probably going to end up with deadlocks sooner or later so we might as well just accept it.
I'm not certain that I follow. The whole point is that if we find that pthread functions are breaking, we don't run the provided routine, and pthead_create_blocking()
returns with an error. In theory, the polling loop should discover that something's very wrong within 1 second, and quit.
Granted, I expect that in such a case, the calling code would just head to an exit(1)
.
Commit 8f9e8ed removes that error-handling code. I think it's fairly harmless and worth keeping, but up to you.
I'm leaning towards reinstating the provided function being required to call
init_func(init_cookie);
The current code in this PR does what it says: it blocks until ${thread} has started. But that's different from blocking until ${start_routine} has started, and I can fairly reliably (maybe 50% of the time) have the pthread_create_blocking
return before the ${start_thread} has begun:
0.00000000 1 main_start
0.00003700 1 thr_start
0.00004200 1 thr_stop
0.00003700 1 main_after_block
0.00007900 1 main_after_join
0.00000000 2 main_start
0.00008200 2 thr_start <---------- here
0.00008300 2 thr_stop
0.00007800 2 main_after_block <---------- here
0.00009100 2 main_after_join
test_pthread_create_blocking: timing.c, 195
test_pthread_create_blocking: info2.thr_start is after info2.main_after_block
("<--- here" editorial marks)
Now, we could argue that ${start_routine} is very likely to start very soon after ${thread} has begun, so that's enough for pthread_create_blocking()
. But that seems like a weak argument to me. If the caller can't rely on ${start_routine} to have begun, then why bother with pthread_create_blocking()
at all?
I thought the point of this was just to make sure that the thread ID was actually valid when the create function returns?
... I can't believe that I forgot that. Huh.
Why have a cleanup routine? The only difference between doing it here and having it in start_routine
is in the case where some very basic pthread routines break, and that's basically a situation of "the OS is broken and there's nothing we can do to recover" anyway.
... I mean, if pthread_mutex_*
or pthread_cond_*
aren't working, what's the odds that pthread_cleanup_*
will work anyway?
I'm imagining the scenario of the caller doing:
if (pthread_create_blocking(thr, ...) != 0)
goto err0;
pthread_cancel(thr);
If pthread_cleanup
is part of start_routine
, then it may or may not be set up, so we're left with indeterminately-cleaned-up resources.
(not that it really matters, but I figure that if we're going to name something after pthread_create
, it should be as bullletproof as possible)
... on a related note, it occurred to me that pthread_
is probably a reserved identifier. (I haven't tracked down the exact reference yet.)
Should we call it something else like create_thread_blocking
? I can't think of any good alternate name right now, but I'm reluctant to use a technically-invalid name.
should be as bullletproof as possible
Just to elaborate on this... you've mused about other groups potentially using some libcperciva files (I think it was parsenum.h in particular). I'm thinking of pthread_create_blocking
in a similar way, i.e. trying to make sure it works (as well as possible) in all situations, not just what we want for spiped in particular.
I'm imagining the scenario of the caller doing:
if (pthread_create_blocking(thr, ...) != 0) goto err0; pthread_cancel(thr);
If
pthread_cleanup
is part ofstart_routine
, then it may or may not be set up, so we're left with indeterminately-cleaned-up resources.
Meh, if you call pthread_cancel
it's up to you to ensure that you know what you're cancelling. In particular, the thread should set up its cleanup handlers appropriately with respect to creating resources; when the thread is first created there's nothing to clean up yet, of course.
... on a related note, it occurred to me that
pthread_
is probably a reserved identifier. (I haven't tracked down the exact reference yet.)Should we call it something else like
create_thread_blocking
? I can't think of any good alternate name right now, but I'm reluctant to use a technically-invalid name.
Meh. Yes, it's reserved; but I'm not too concerned about this. Maybe call it pthread_create_blocking_np
though; the _np
suffix is widely used to denote non-POSIX extensions to pthreads.
Ok, I'll remove the cleanup
stuff again. I forgot that pthread_cancel
only takes effect at a cancellation point, so the provided function can safely ensure that it sets up its own cleanup (if it wants one) before any cancellation points.
Can I rebase this branch? It's such a mess now, keeping REBASE commits doesn't make it easier to see what's happening.
when the thread is first created there's nothing to clean up yet, of course.
As a quibble, that wouldn't allow a thread to take ownership of resources.
For example, in spipe/pushbits.c
we have:
pushbits()
allocates struct push * P
.workthread()
sets up workthread_cleanup
, which does free(P)
.But that's ok with the current design, because the ${start_routine}
can set up its own cleanup function before any cancellation points.
Yes, feel free to clean up the history. I'm only looking at the final version of the file right now anyway.
when the thread is first created there's nothing to clean up yet, of course.
As a quibble, that wouldn't allow a thread to take ownership of resources.
It could, but I agree that it's more work. (e.g. by locking a shared mutex and setting a thread_has_taken_ownership
variable.)
Rebased branch, but included one MAYBE commit.
8 days ago, you said:
I'm inclined to just say that if pthread functions are breaking, we're probably going to end up with deadlocks sooner or later so we might as well just accept it. If you're really worried, mark the mutex as robust so that pthread_cond_wait will return if the spawned thread dies.
I don't think that I understand. As far as I can tell, as long as a breaking pthread function returns a non-zero value, we can at least ensure that pthread_create_blocking_np()
returns non-zero.
I know that so far you've been looking at the overall "files changed", but please take a glance at this single commit: 071e164 MAYBE add polling check for signalling error
Without this, if there's a problem with pthread_lock
or pthread_cond_signal
, pthread_create_blocking_np()
can block indefinitely because U->cond
was never signalled.
With this commit, the worst case is that pthread_create_blocking_np()
will return an error within a second of such a failure.
I don't like the MAYBE. On the positive side, it would avoid a deadlock if... err... very low-level pthreads functions are broken? But on the downside, it could introduce failures if a system is so heavily loaded that it takes more than a second before a new thread starts running.
Ok, I've reverted the MAYBE, and added REBASE commits for everything you noted.
But on the downside, it could introduce failures if a system is so heavily loaded that it takes more than a second before a new thread starts running.
Wait, I don't understand this part.
If pthread_cond_timedwait
in 071e164 ends with a ETIMEDOUT, we ignore the error and cond_timedwait
returns with a 0 so there's no goto err5
. U->rsync
is also still a 0, so there's also no goto err5
. U->running
is still 0, so we start another 1-second timed wait. We keep on looping -- checking if there's an error every second -- until either the thread has started, or we've found an error.
(I've already reverted the commit in this branch; I'm asking for general pthread knowledge.)
Added a MAYBE comment in c5900e43 to explain why we're not using polling.
But on the downside, it could introduce failures if a system is so heavily loaded that it takes more than a second before a new thread starts running.
Wait, I don't understand this part.
If
pthread_cond_timedwait
in 071e164 ends with a ETIMEDOUT, we ignore the error andcond_timedwait
returns with a 0 so there's nogoto err5
.U->rsync
is also still a 0, so there's also nogoto err5
.U->running
is still 0, so we start another 1-second timed wait. We keep on looping -- checking if there's an error every second -- until either the thread has started, or we've found an error.(I've already reverted the commit in this branch; I'm asking for general pthread knowledge.)
Never mind, I was misreading the code. Yeah, not dangerous; just useless. In fact, even more useless than I realized at first, since pthread_cond_timedwait isn't guaranteed to return after the specified time; rather, after the specified time it tries to reacquire the mutex and blocks until that succeeds. So the case of the pthread library being broken and is likely to result in this polling not actually polling anyway.
I think the MAYBE comment will add more confusion than it removes, at least among people familiar with pthreads. But the rest looks good; please rebase.
Maybe worth s/main thread/parent thread/ everywhere? Since conceivably any thread could spawn a new one. Up to you; both options are clear, I think.
Rebased with "parent thread" instead of "main thread". Potentially ready for merge.
Wait, I forgot to update "main -> parent" in the test file timing.c
.
Fixed, potentially ready for merge.
Alternate API (instead of #244). I like it much better; easier to reason about it.
I'm still not 100% confident about the internal implementation, though... it feels like it could be refactored with one or two more functions to make it more readable. OTOH, the last two times that I looked at it, I couldn't see any other "architectural" improvements.