Intrepid / upc-specification

Automatically exported from code.google.com/p/upc-specification
0 stars 1 forks source link

Clarification: data tearing and read/write ordering #61

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What guarantees should/can the UPC language specification offer with respect to 
"data tearing", or the reading/writing of data that may be implemented as more 
than one aligned read/write to main memory?

(The following notes are transcribed from some suggestions made by Steve 
Watanabe [Boostpro].)

- a normal scalar access must resolve to a single
memory operation.
- an unaligned scalar access may create multiple
memory operations.
- a bitfield access may create multiple memory operations.
- a bitfield write may read and write adjacent bitfields.
- an aggregate is accessed member-wise.
- operators that both read and write the same scalar value,
such as the increment operator create both a read
and a write memory operation.

The aggregate case might be relaxed further:

- The order in which the members are accessed
is unspecified and need not be consistent
across threads.

The problem case is shared scalars whose size
is greater than what the underlying hardware
supports. e.g. __int128_t on a 64-bit system 
or long long on a 32-bit system. From a language
consistency point of view I'd like it to be 
a single memory operation. Having something 
like the rules for signal handling in C would
be a real nightmare. (Only variables of type
volatile sig_atomic_t are guaranteed to be
valid in a signal handler.)

On the other hand, we'd also like to avoid the overhead
of implementing atomic operations for large
scalar types. At the very least, some base
set of arithmetic types needs to have atomic
load/store guaranteed.

Original issue reported on code.google.com by gary.funck on 17 Jul 2012 at 5:52

GoogleCodeExporter commented 9 years ago
As an implementer I loath the idea of, for instance, making access to 64-bit 
"double" atomic on a 32-bit platform.  I also agree, however, that the C signal 
handling idea that ONLY one specific type is atomic is pretty much useless for 
any concurrent programming, including not just UPC but also pthreads, etc.

So, I am fine with the proposal *IF* the first bullet is changed from
 - a normal scalar access must resolve to a single memory operation.
to
 - a scalar access up to an implementation-defined size and with implementation-defined alignment must resolve to a single memory operation.

Note 1: "implementation-defined" means the implementation is required to 
DOCUMENT the size and alignment restrictions

Note 2: there are "broken" ABIs, such as for PPC64 on AIX, where the CPU word 
size is 64-bits, but 64-bit "double" and "long long" is given only 4-byte 
alignment!  This is a platform where the "implementation-defined alignment" 
would be used to state what might otherwise not be obvious to the user.

Original comment by phhargr...@lbl.gov on 17 Jul 2012 at 7:36

GoogleCodeExporter commented 9 years ago
Set default Consensus to "Low".

Original comment by gary.funck on 19 Aug 2012 at 11:26

GoogleCodeExporter commented 9 years ago
Change Status to New: Requires review.

Original comment by gary.funck on 19 Aug 2012 at 11:37

GoogleCodeExporter commented 9 years ago
I will retain ownership of this issue.

Original comment by gary.funck on 19 Sep 2012 at 5:04

GoogleCodeExporter commented 9 years ago
Note that bit-fields are technically scalars (like all integer types), so it'd 
be nice to qualify that a bit more.  I don't know that it's reasonable to 
require bit-field updates to be tear-free.

Original comment by sdvor...@cray.com on 21 Sep 2012 at 7:59

GoogleCodeExporter commented 9 years ago
"The problem case is shared scalars whose size is greater than what the 
underlying hardware supports."

The problem is actually worse than stated in comment 0. There are also 
architectures that can data tear in the opposite direction. Specifically, when 
performing a write of size SMALLER than the hardware word size, they do a 
read-modify-write of a larger size (word or even cache line) and the writeback 
can therefore clobber concurrent writes to the word data surrounding the small 
write performed at the language level. This affects bitfield writes on almost 
every achitecture, but can also affect byte writes on certain systems. Most 
architectures include a byte mask in the writeback so the memory controller 
only writes the actual dirty bytes, but I'm not sure we should assume that's 
universally available.

Because of these competing tensions, some architectures may only support 
atomic, tear-free writes of only a single data size, and only when aligned. 
This is why C99 only requires implementations to provide tear-free updates of a 
single type (sig_atomic_t sec 7.14). UPC technically inherits sig_atomic_t, but 
C99 explicitly allows this type to be volatile-qualified (read "completely 
unoptimized"). Also there is no guarantee on the range of values this type can 
hold (read "portability problem"), and in any case it's definitely an integer 
type, which rules out floating-point values. Overall, this is probably not a 
type we should be teaching HPC users to use for their main data structures.

I agree with Paul that we should not provide attempt to provide a universal 
guarantee of tear-free memory operations - such a guarantee could make UPC 
unimplementable on many architectures of interest. I think the best we can 
universally require is a single "implementation-defined" type that will be 
tear-free - but this basically bring us back to sig_atomic_t, which is already 
available.

Overall I prefer the model of encouraging users to write programs that are 
properly synchronized (without data races that can expose tearing). 
Alternatively if they insist upon including data races in their program, then 
encourage them use the AMO interface, where the effects of tearing can be 
prevented by handling concurrent accesses in a principled manner within the 
library. This seems far preferable to specifying something about all concurrent 
accesses anywhere in the program (even of a certain size), which seems likely 
to imply new implementation headaches, subtle implementation bugs, and possibly 
global negative performance impacts. I suspect the standardization and wide 
availability of an AMO library will help to reduce the importance of this issue 
for many users.

I move that we postpone this issue to 1.4 or later, and re-consider the issue 
once the standardized AMO library reaches widespread acceptance.

Original comment by danbonachea on 25 Sep 2012 at 12:08

GoogleCodeExporter commented 9 years ago
deferred to 1.4 at the 11/29 telecon

Original comment by danbonachea on 29 Nov 2012 at 7:35