chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

CopyAggregation on 32 bit platforms #24148

Open jabraham17 opened 10 months ago

jabraham17 commented 10 months ago

Recently, I updated a custom atomic aggregator in our tests to include some of the optimizations already present in the CopyAggregation package. This ended up causing a correctness issue with 32 bit platforms. This was surprising since it was nearly the same code as the CopyAggregation. One of the pieces missing was an opt-out for CHPL_COMM=none. The aggregators in CopyAggregation don't actually do any aggregation when CHPL_COMM=none. This lead me to the realization those aggregators may actually not work correctly on 32 bit platforms, since that code actually isn't being exercised in our 32 bit nightly testing.

The problem arises from a call to CTypes.allocate, where we pass an int(64) to a CTypes.c_size_t. On 32-bit platforms, this fails to compile. This is only an issue with 32 bit platforms with CHPL_COMM!=none.

A simple band-aid here is to put a nicer error message along the lines of "Aggregators are not supported on multi-locale 32-bit platforms".

mppf commented 10 months ago

What I'd like to see here is that we have more Chapel-native types that we can use instead of falling back on CTypes to optimize certain things. In this case, that would be a Chapel way to create a low-level buffer. Perhaps, we'd want wide pointer and local pointer variants. IMO, this is a long-standing TODO. See also:

e-kayrakli commented 10 months ago

A simple band-aid here is to put a nicer error message along the lines of "Aggregators are not supported on multi-locale 32-bit platforms".

A small addition could be to issue a warning when --auto-aggregation is used to say that that flag is ignored with this configuration.

ronawho commented 9 months ago

This code used to use c_malloc and friends, which took an integral and cast to a c_size_t. I would guess the copy aggregators could be updated to just do the cast directly. I don't know of any fundamental issue that would prevent them from working on 32-bit platforms.

FWIW I agree with the desire for a better low-level buffer. This code uses c_ptr for two reasons -- wanted to align allocations to avoid false-sharing and know that some things are only used locally and wanted to prevent the compiler from using wide-pointers. May also have been a minor benefit from the faster c_ptr indexing, but alignment and avoiding wide-pointers were the big ones. The last comment in https://github.com/Cray/chapel-private/issues/5787 also has a little discussion about data structures that would simplify writing new aggregators.