Many GPU algorithms are formulated with lots of mutation, and Harlan doesn't handle these with the greatest performance. We should consider some restricted forms of mutation. Some of them may not even be hard to implement to experiment with.
Here's an example of one that we were talking about earlier today:
(kernel-update! ((x xs) <- (y ys))
(* x y))
This is sort of like (set! xs (kernel ((x xs) (y ys)) (* x y))), but it would overwrite the contents of xs with the result of this kernel (perhaps with a check to make sure we don't write the exact same value), rather than making xs point to a new vector.
There are a couple of issues to discuss with a form like this.
I think at the very least we need to keep type safety, meaning basically if you read a value that's supposed to have a certain type, the thing you get actually has that type. As long as all the updates are atomic, I don't think this will be a problem.
The next issue is how much determinism we need. So far Harlan is basically deterministic (modulo I/O, etc), and this is a nice property to have. However, nondeterminism is nice too. For example, EigenCFA tolerates monotonic races to get much better performance.
One hopefully-not-too-difficult-to-enforce option would be that kernel-update! cannot do arbitrary reads from its destination. I think it's safe for each thread to get the previous value (which is why the example binds x to an element in xs), and that this should not break determinism. This seems safe because we implicitly synchronize at the end of a kernel-update!.
Another option would be to allow monotonic reads/updates. This seems way harder to enforce, but doing so would allow us to relax synchronization some for possibly even better performance. For example, consider a BFS traversal like this:
(kernel-update! ((_ depths) <- (v vertices))
(+ 1 (reduce min (kernel ((i (incoming-edges v)))
(vector-ref depths i)))))
This version could race, but since we'd use it as part of a fixpoint algorithm, we should get the same answer eventually anyway. This works because the body of the kernel is monotonically decreasing. I feel like proving that the body is monotonically decreasing would be really hard though.
Finally, we need to consider how much this impacts other optimizations. We want to avoid more chances to introduce bugs like #56.
Many GPU algorithms are formulated with lots of mutation, and Harlan doesn't handle these with the greatest performance. We should consider some restricted forms of mutation. Some of them may not even be hard to implement to experiment with.
Here's an example of one that we were talking about earlier today:
This is sort of like
(set! xs (kernel ((x xs) (y ys)) (* x y)))
, but it would overwrite the contents ofxs
with the result of this kernel (perhaps with a check to make sure we don't write the exact same value), rather than makingxs
point to a new vector.There are a couple of issues to discuss with a form like this.
I think at the very least we need to keep type safety, meaning basically if you read a value that's supposed to have a certain type, the thing you get actually has that type. As long as all the updates are atomic, I don't think this will be a problem.
The next issue is how much determinism we need. So far Harlan is basically deterministic (modulo I/O, etc), and this is a nice property to have. However, nondeterminism is nice too. For example, EigenCFA tolerates monotonic races to get much better performance.
One hopefully-not-too-difficult-to-enforce option would be that
kernel-update!
cannot do arbitrary reads from its destination. I think it's safe for each thread to get the previous value (which is why the example bindsx
to an element inxs
), and that this should not break determinism. This seems safe because we implicitly synchronize at the end of akernel-update!
.Another option would be to allow monotonic reads/updates. This seems way harder to enforce, but doing so would allow us to relax synchronization some for possibly even better performance. For example, consider a BFS traversal like this:
This version could race, but since we'd use it as part of a fixpoint algorithm, we should get the same answer eventually anyway. This works because the body of the kernel is monotonically decreasing. I feel like proving that the body is monotonically decreasing would be really hard though.
Finally, we need to consider how much this impacts other optimizations. We want to avoid more chances to introduce bugs like #56.