Problem with multithreaded iteration over non-T-type array using lparallel

phantomics commented 2 years ago

It seems that Clasp (tested latest version) has a problem with multithreaded iteration over non-T-type numeric arrays. Here is example code using April:

(ql:quickload 'april) (in-package :april) ;; load and enter the April package

(april "1+1 2 3") ;; evaluate an April expression to spawn an lparallel kernel

;; then run a multithreaded array operation
(let ((output (make-array '(10 6) :element-type '(unsigned-byte 4) :initial-element 0)))
  (xdotimes output (i (size output))
    (multiple-value-bind (oseg remainder) (floor i 6)
      (let ((dx (loop :for d :across #(1 3 6) :for di :from 0
                      :when (> d remainder) :return di)))
        (if (< 0 (aref #(1 -2 3) dx))
            (setf (row-major-aref output i)
                  (row-major-aref (april "10 3⍴⍳5")
                                  (+ dx (* oseg 3))))))))
  output)

;; this code comes from the (expand) function, here is an invocation that does the same thing
(expand #(1 -2 3) (april "3 ⎕DT 10 3⍴⍳5") 1 :compress-mode t)

#2A((1 0 0 3 3 3) ;; usually you get this, the correct output
    (4 0 0 1 1 1)
    (2 0 0 4 4 4)
    (5 0 0 2 2 2)
    (3 0 0 5 5 5)
    (1 0 0 3 3 3)
    (4 0 0 1 1 1)
    (2 0 0 4 4 4)
    (5 0 0 2 2 2)
    (3 0 0 5 5 5))

#2A((1 0 0 3 3 3) ;; but sometimes, you get something like this
    (4 0 0 1 1 1)
    (0 0 0 4 4 4)
    (5 0 0 0 0 0) ;; ← note the difference
    (3 0 0 5 5 5)
    (1 0 0 3 3 3)
    (4 0 0 1 1 1)
    (0 0 0 4 4 4)
    (5 0 0 0 0 0)
    (3 0 0 5 5 5))

This is happening because two threads are attempting to write to array elements that are being held in the same register at the same time, and one write attempt clobbers the other. Note that I create the input array using April above. This is because:

* (type-of (april "10 3⍴⍳5"))

(SIMPLE-ARRAY EXT:BYTE4 (10 3))

* (type-of #2A((1 2 3)(4 5 1)(2 3 4)(5 1 2)(3 4 5)(1 2 3)(4 5 1)(2 3 4)(5 1 2)(3 4 5)))

(SIMPLE-ARRAY T (10 3))

April generates integer arrays of the smallest type necessary to represent the numbers inside. If you try:

(expand #(1 -2 3) #2A((1 2 3)(4 5 1)(2 3 4)(5 1 2)(3 4 5)(1 2 3)(4 5 1)(2 3 4)(5 1 2)(3 4 5)) 1 :compress-mode t)

You'll find that it always works. Multithreaded iteration over a T-type array doesn't cause problems. I dealt with these issues earlier while getting April to work multithreaded. The consistent pattern in other CLs I found was that 8-bit integers and larger are stored individually in registers when doing array processing, while integers smaller than 8 bits are stored in blocks in 64-bit registers. This means that in order to do multithreaded operations on sub-8-bit arrays, I need to operate on the elements in 64-bit chunks, so for a 4-bit array April will process the elements in blocks of 16 elements in each thread.

However, forcing 8-bit arrays in Clasp still poses a problem. To test with an 8-bit integer array you can do this:

(expand #(1 -2 3) (april "3 ⎕DT 10 3⍴⍳5") 1 :compress-mode t)

And you'll find that the output is still sometimes wrong. Wrong output seems to be rarer with 8-bit arrays though.

The multithreaded operation is done using the (xdotimes) macro. It wraps lparallel's pdotimes in a way that's convenient for use with arrays. See its code here.

This bug affects many April functions, all of which work in basically the same way, iterating over an output array in row-major order, finding the needed element from the input array and assigning it to the output position. In this case the assignments to the output array are clobbering each other, so a first step would be to check how arrays are modeled and what kind of registers are used to hold elements in typed integer arrays compared to T-type arrays.

Let me know if more information would help.

Bike commented 2 years ago

Are you saying that you have a non-T array (specialized on (unsigned-byte 8) or something), and you have multiple threads writing to different positions in that array concurrently? And that writes in one thread are clobbering writes in another because the underlying storage operations write entire words at a time?

phantomics commented 2 years ago

Writing entire words at a time occurs only on arrays with elements smaller than 8 bits. At 8 bits or larger, write operations assign one element's worth of data at a time. At least, that's the case in other CLs I've tested.

I discovered the word-at-a-time problem before and wrote a workaround that splits arrays into word-sized chunks for each thread. But now the clobbering is happening in Clasp, and it happens whether the number is 8-bit or smaller. Another data point: I just tried many iterations with a 16-bit array, and there were no errors.

If Clasp is writing words at a time, it should be fine because April has logic in place to handle that, at least for sub-8-bit elements. Something else must cause the clobbering.

Note: I updated the repro method above, if it gave you problems before it should work now.

Bike commented 2 years ago

OK, so a few things here.

As you probably know, most machines cannot actually do writes of less than a byte at a time. As such, it is not possible to write in a nibble without writing in an adjacent nibble as well. That's why you see interference on other implementations for less than 8 bits - to fix it the runtime would need to give each nibble its own byte, which kind of defeats the point of the specialization.

Now, even that aside, you are kind of in the undefined behavior zone here. Lisps support threading but don't generally define detailed semantics for it much at all, or if they do, they're pretty vague past the "shared objects are only touched in locks" level. If you aren't aware, compilers and machines do absolutely bonkers optimizations that make it very hard to analyze lockless concurrent programs, and sometimes those optimizations can't be suppressed without going out of your way in the code generator, so this makes the kind of code you're writing here unfortunately problematic. Most Lisp implementations do not do such optimizations, but in Clasp we leverage LLVM which does. (Even in, like, SBCL, it might be sometimes required to put in fences manually, that kind of thing; they do so internally.)

I've done some provisional work on defining semantics. What it boils down to is that I would, in general, like code like you have here to have to specify that reads and writes need to be atomic - something like (setf (atomic (aref ...)) ...) - so that the compiler can apply optimizations in the non atomic case while not applying them in the atomic case. And so that it can issue a warning if you ask it to do something it can't handle, like atomic sub-byte accesses. Clasp has a bit of an extension in that direction, but it's nothing like complete.

That said, it would be nice if your code worked regardless. I think I have some idea of why it doesn't. Clasp's arrays are defined in C++, and for now at least the access functions are as well. In order to perform atomic accesses in C++, you need to mark types as std::atomic. We do this in a few places, so e.g. standard-instance-access actually is atomic, but we don't do it for arrays. (At the runtime level, this is GCArray_moveable versus GCArray_atomic, both defined in gcarray.h; you can see that the atomic version is used in instances here, but vectors use the non atomic version here.)

Because of this, our C++ compiler is technically allowed to do optimizations on the assumption that the code you're writing doesn't happen and the arrays are only accessed from one thread at a time. Sometimes it apparently does so and things end badly, as for the byte arrays, and sometimes it doesn't and things work somewhat by coincidence, as for the T arrays. But in either case the situation is not as stable as it ought to be. This is the case for all arrays, not just byte arrays or anything, so the fact that Clasp does happen to store byte arrays as sequences of bytes rather than words might not help. What could be happening is that the compiler or processor (e.g. cache) decides to work with words or half words instead of bytes for whatever reason, which for a single threaded program would not be a problem.

So for a start it would be nice if Clasp used atomic arrays at least in most situations. Unfortunately, we have some legacy code that grabs pointers interior to our arrays, which makes that difficult. Fixing that could take some time. Secondarily, as the compiler gets smarter we will have to be sure not to allow especially wild optimizations.

It's possible something else is going on. I haven't analyzed what your code is doing in detail. I don't know APL beyond the cursory level, and it's hard to, e.g., navigate the 180 line expand function, especially without knowing the context of how April or APL work or what the function is supposed to be doing. (Like, I have no idea what "Expand an input array as per a vector of degrees" means. I could try to figure it out, but it's a bit of a distraction from the actual problem, which is after all in Clasp and not April.) If you have the time and resources, it would be helpful to get a reproduction of the problem independent of April, for example using lparallel and maybe your xdotimes or ydotimes macros. Or even better, stripped back to using bordeaux threads, since that's just a portability layer. That's the kind of thing we have in mind for example code in the ideal case. (ETA: I tried this a bit myself but was unable to find the issue.) If you don't have time, that's fine too, and the problem I outlined above is probably it anyway.

phantomics commented 2 years ago

Thanks for the explanation. The example expression at the top is already mostly independent of April, here's a completely independent version that only uses lparallel:

(let* ((input (make-array '(10 3) :element-type '(unsigned-byte 4) :initial-contents 
                          '((1 2 3)(4 5 1)(2 3 4)(5 1 2)(3 4 5)(1 2 3)(4 5 1)(2 3 4)(5 1 2)(3 4 5))))
       (output (make-array '(10 6) :element-type '(unsigned-byte 4) :initial-element 0))
       (otype (element-type output)))
  (let ((iterations (if (eq 'bit otype)
                        64 (/ 64 (second otype)))))
    (multiple-value-bind (dividend remainder)
        (ceiling (size output) iterations)
      (pdotimes (x dividend)
        (dotimes (y (+ iterations (if (< x (1- dividend))
                                      0 remainder)))
          (let ((i (+ y (* x iterations))))
            (multiple-value-bind (oseg remainder)
                (floor i 6)
              (let ((dx (loop :for d :across #(1 3 6) :for di :from 0
                              :when (> d remainder) :return di)))
                (if (< 0 (aref #(1 -2 3) dx))
                    (setf (row-major-aref output i)
                          (row-major-aref input (+ dx (* oseg 3)))))))))))
    output))

I haven't worked directly with bordeaux-threads so I can't be of much help there. I could look into it but it'd take some time.

It makes sense that advanced optimizations done by the compiler are interfering with April's optimizations; array programming languages have unique potentials for optimization that aren't found in other types of languages, so compilers for other languages that do different kinds of deep optimization can be expected to conflict.

The quick and dirty way to fix this is simply to disable multithreading on array operations with small integer types within Clasp. I could also make all generated arrays T-type in Clasp but I would have to watch out for corner cases affected by such a change.

The expand function basically stretches an array over one of its dimensions. You can see documentation for it here if it interests you. The important thing to know is that all of these APL array manipulation functions work in a similar way. They take an input array and some parameters, figure out what the dimensions of the output array will be, and create the blank output array. They then iterate over the output array, take the row-major index of each element and do arithmetic on that index to find the corresponding index from the input array. Finally they assign the value of the corresponding index from the input to the proper position in the output. So basically you have many different ways of shuffling array elements around. This task is a natural fit for multithreading with no data needing to be shared between threads.

Let me know if any other info would help.

phantomics commented 2 years ago

Trying the code above, I just realized something. The integer element types in Clasp are not expressed as lists like (unsigned-byte 4), instead they're symbols like ext:byte4. April expects integer types to be expressed as lists so that may be a cause for the problem here.

Bike commented 2 years ago

Yeah, that's kind of a weird thing. They should be type equivalent at least, like (subtypep 'ext:byte4 '(unsigned-byte 4)) and so on.

Thank you for the independent example. just doing it in terms of lparallel should be fine. Is there still a problem in the example with larger element types? Because multithreaded clobbering for sub-byte arrays is pretty much a WONTFIX situation - there's no practical way to avoid it that I am aware of, unless one counts not having sub-byte array specializations at all.

phantomics commented 2 years ago

The way to avoid sub-byte clobbering should just be to break the array into word-sized chunks, right? Then each thread assigns the array a word at a time and no clobbering occurs, at least not on other CLs.

Bike commented 2 years ago

Eh? No, that's exactly why there is clobbering. Say we have an (unsigned-byte 4) vector 16 long, initially all zero. Thread one does (setf (aref vec 0) 15) and thread two does (setf (aref vec 1) 15). Under the hood, that means thread one does (setf (%word vec 0) #xf000000000000000) and thread two does (setf (%word vec 0) #x0f00000000000000), i.e. they write to the same storage word. So once the dust has cleared, either the word is #xf000000000000000 or #x0f00000000000000, and not what it should be, #xff00000000000000.

I am confused. I thought you said that on all implementations there was clobbering on (unsigned-byte 4) arrays, so you opted not to work with those directly, and now you were seeing clobbering on (unsigned-byte 8) arrays which ought not to be subjected to this problem

Bike commented 2 years ago

I can get clobbering on SBCL with (unsigned-byte 4), though of course it's not common:

(loop with arr = (make-array 2 :element-type '(unsigned-byte 4))
      repeat 1000000
      do (fill arr 0)
         (let* ((t1 (sb-thread:make-thread
                     (lambda () (setf (aref arr 0) 15))))
                (t2 (sb-thread:make-thread
                     (lambda () (setf (aref arr 1) 15)))))
           (sb-thread:join-thread t1)
           (sb-thread:join-thread t2)
           (unless (equalp arr #(15 15)) (return arr))))
=> #(0 15)

phantomics commented 2 years ago

You're correct on how the array storage works. The way I compensate for that is to divide the array into word-sized chunks for the threads. Let's say I have a 100-element vector of 4-bit ints. I assign the first 16 elements (64 bits) in thread 1, the next 16 in thread 2, etc. Thus, no clobbering. This should happen in Clasp but it seems not to. The other strange thing is that clobbering happens with 8-bit ints, which should be safe to assign using one element per thread, but there may be an LLVM optimization interfering as you said. I'm looking into the byte4 symbol issue to see if that's the cause; perhaps in Clasp I should always assign array elements in word-sized chunks.

P.S. is there a simple way to get the element size as a number for a given array in Clasp? For example, a function that given an array of 16-bit ints (signed or unsigned) would return 16.

Bike commented 2 years ago

Okay, so when I said

multithreaded clobbering for sub-byte arrays is pretty much a WONTFIX situation - there's no practical way to avoid it that I am aware of, unless one counts not having sub-byte array specializations at all

what I meant was, in Clasp, code like the SBCL example above is always going to be broken and there's nothing Clasp can do to avoid it, so I would call it WONTFIX from Clasp's perspective. If you want to rearrange your code to avoid the problem, that's something else.

I think I understand your code better now. I will try cooking up an example like yours that's showing bad behavior despite you working around the sub-byte clobbering.

P.S. is there a simple way to get the element size as a number for a given array in Clasp? For example, a function that given an array of 16-bit ints (signed or unsigned) would return 16.

Not really. You could look for the ext:byte4 etc., but we're probably going to expand the set of upgraded element types in the future, so that would be fragile. The conforming and reliable way would be to do a bunch of (subtypep (array-element-type arr) (unsigned-byte n)) type checks, but of course that's not simple. We could add a special retrieval function as an extension, but that wouldn't be portable.

Bike commented 2 years ago

first off, i rewrote my sbcl example there to work in clasp, and to not cons in the loop. clobbering with (unsigned-byte 4) element type is easily observed. i have not seen it with (unsigned-byte 8). i will try the lparallel chunking code in a bit.

phantomics commented 2 years ago

You can see some sample chunking code in my non-April example above, in the (let ((iterations ... form. The iterations value is the chunk size, i.e. the number of iterations to perform in a given thread.

Bike commented 2 years ago

What is size? array-total-size?

phantomics commented 2 years ago

Yes, it's a shorthand from the array-operations package.

phantomics commented 2 years ago

Also, you may want to hold off for the moment, I'm getting April to recognize Clasp's symbol-based integer types, which may solve the problem.

Bike commented 2 years ago

Assumed it's array-total-size, also fixed up the element types, and tried the following reproducer that does not use anything outside of Clasp: test.txt

with runner

(loop with s = (test-serial)
      for p = (test-parallel)
      repeat 1000
      unless (equalp p s) return p)

After several tens of thousand of iterations I got a hit

#2A((1 0 0 3 3 3)
    (4 0 0 1 1 1)
    (2 0 0 4 0 0)
    (0 0 0 0 0 0)
    (0 0 0 0 0 0)
    (0 0 0 0 0 0)
    (0 0 0 0 0 0)
    (0 0 0 0 0 0)
    (0 0 0 0 0 0)
    (0 0 0 0 0 0))

which seems close enough to the failure you showed earlier.

So, looking at this. First thing is, I don't think we are ever going to be able to guarantee a lack of elementwise interference on (unsigned-byte 4) arrays, in any circumstances. Chunking might happen to work sometimes, but it depends on a bunch of stuff like how large machine words are, how large the storage Clasp chooses is, how Clasp and clang actually lay out the storage in memory, etc. (I mean, maybe we don't choose to do it row-majorly, for one.) If you want to safely access an array without using locks, you'll probably need an element type of at least eight bits, at minimum.

Secondly, from my test here there does genuinely seem to be some kind of interference between the 64-bit storage words. That's bad, but only kind of bad, because again, I don't think Clasp can sanely guarantee that ub4 access will work.

If you do find interference/clobbering with ub8 or higher arrays, that will definitely be a bug, though.

phantomics commented 2 years ago

The cause of the problem I was having was that April expected integer arrays to have lists as their element types, (unsigned-byte X) and so forth. I added a Clasp-specific clause to check for the byteX and integerX symbols and now they are passed to the chunking branch of the code which wasn't happening before, hence the clobbering. Now that chunking is active I haven't seen problems across many tests.

By the way, the error you're quoting above doesn't appear to be a collision issue to me. Something caused only one thread to operate. Notice that the last nonzero element is the 16th (row-major). The first thread completed the first chunk, but for some reason the subsequent threads didn't do anything, assuming this is an (unsigned-byte 4) array and the chunks are 16 elements. Collisions result in random zeroes peppered in the array, where two threads tried to write the same thing at once. All zeroes past a certain point means that no copying happened there. It would be worth looking into this, could be a problem handling threads.

If you've tried running many iterations and haven't found any zero-peppered arrays, it seems the chunking might work after all. I can do some more testing and see if I find any problems, in that case I can just make operations on sub-8-bit arrays singlethreaded.

clasp-developers / clasp

Problem with multithreaded iteration over non-T-type array using lparallel #1249