alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
356 stars 74 forks source link

simplification of the user interface #944

Open psychocoderHPC opened 4 years ago

psychocoderHPC commented 4 years ago

Motivation

Alpaka gives users high flexibility and freedom for their implementations. Alpaka is explicit everywhere and can therefore be controlled in fine granularity. For IMO 90% of the users, simple usage - at least to get started with a library - is very important.

Proposal

Alpaka is from the view of a user very hard to use, therefore I like to do the first step and propose an example of our vector add written against a pseudo API simple.

Example

using namespace alpaka;

// Define the accelerator
// - GpuCuda
// - CpuThreads
// - CpuFibers
// - CpuOmp2Threads
// - CpuOmp2Blocks
// - CpuOmp4
// - CpuTbbBlocks
// - CpuSerial
// - GpuHip
constexpr size_t dim = 1;
auto numDevices = simple:acc::count(simple::acc::GpuCuda);
// an accelerator instance is bound to an device
auto acc = simple::Acc<dim>(simple::acc::CpuSerial, numDevices % 1);
auto queue = simple::Queue(acc, simple::queue::NonBlocking{});

// Define the work division
size_t const elementsPerThread(3u);
// create array for extent indices based on the accelerator dimension
auto const extent = simple::mem::Array(acc, 123456u);

// Define the buffer element type
using Data = std::uint32_t;

// Get the host device for allocating memory on the host.
auto const accHost = simple::Acc<dim>(simple::acc::CpuSerial, numDevices % 1);

// Allocate 3 host memory buffers
auto bufHostA = simple::mem::Vector(accHost, extent);
auto bufHostB = simple::mem::Vector(accHost, extent);
auto bufHostC = simple::mem::Vector(accHost, extent);

// Random generator for uniformly distributed numbers in {1,..,42}
auto engine = simple::rng::Engine(accHost, simple::rng::engine::MersenneTwister{}, 42u);
auto rng = simple::rng::distribution::Uniform<Data>(1, 42);

// Initialize the host input vectors A and B
for (Idx i(0); i < numElements; ++i)
{
    bufHostA[i] = rng(engine);
    bufHostB[i] = rng(engine);
    bufHostC[i] = 0;
}

// Allocate 3 buffers on the accelerator
auto bufAccA = simple::mem::Vector(acc, extent);
auto bufAccB = simple::mem::Vector(acc, extent);
auto bufAccC = simple::mem::Vector(acc, extent);

// Copy Host -> Acc
simple::mem::copy(queue, bufAccA, bufHostA); // copy full buffer
// also the full array but show possibility to give an offset
simple::mem::copy(queue, bufAccB, bufHostB, simple::mem::slice(0,extent));
simple::mem::copy(queue, bufAccC, bufHostC, extent);

// Let alpaka calculate good block and grid sizes given our full problem extent
auto workDiv = simple::workdiv::makeValid(
    acc, extent, elementsPerThread, false, simple::workdiv::restriction::None{});

// Create the kernel execution task.
auto const kernel = simple::kernel(
    acc, VectorAddKernel{}, workDiv,
    bufAccA.iterator(acc), bufAccB.iterator(acc), bufAccC.iterator(acc),
    numElements);

// Enqueue the kernel execution task
simple::queue::enqueue(queue, kernel);

// Copy back the result
simple::mem::copy(queue, bufHostC, bufAccC);
simple::sync::wait(queue);

bool resultCorrect(true);
for(Idx i(0u);
    i < numElements;
    ++i)
{
    Data const & val(bufHostC[i]);
    Data const correctResult(bufHostA[i] + bufHostB[i]);
    if(val != correctResult)
    {
        std::cerr << "C[" << i << "] == " << val << " != " << correctResult << std::endl;
        resultCorrect = false;
    }
}

Additional option

Even if I show here a self assembled interface based of the current usage of alpaka we should think about of deriving an interface based on the SYCL standard. I also thought lot of creating a SYCL frondend with alpaka as backend, but the SYCL API based a lot of runtime polymorphic (which is maybe removed by the SYCL compiler).

CC-ing: @ComputationalRadiationPhysics/alpaka-developers @ComputationalRadiationPhysics/alpaka-maintainers

tdd11235813 commented 4 years ago

Thanks a lot for the inspiring concept work. A more simple frontend would help not only the newcomers. It just makes coding more productive and readable. Code examples also would fit on a few slides then, hopefully ;-)

Alpaka buffer: Perhaps it is sufficient to pass the buffer instead of the iterator, because the buffer already has all information. But I see the idea, that you can manage the accelerator-dependent access methods by that. Not sure, how the kernel interface shall look like, here is an example how SYCL deals with it:

    buffer mybuffer_d(mybuffer_h);
    queue myQueue;
    command_group(myQueue, [&]()
    {
      // Data accessors
     auto a = mybuffer_d.get_access<access::read>();
      // Kernel
      parallel_for(count, kernel_functor([ = ](id<> item) {
        int i = item.get_global(0);
        // ... do something with a[i]

For the record, SYCL is fully C++-compliant and by a C++ library you could map SYCL code to an OpenMP backend with an ordinary C++ compiler (IIRC codeplay had an OpenMP version, but at a very very early stage). The question is, if we can simply move the relevant runtime polymorphic to compile-time one and how the API will look then.

(Btw, besides of the integration of the buffer concept, we also have to consider multiple platform levels because the backend can be heterogeneous itself. This means, that you not only traverse the devices, but also the platforms like OpenCL and SYCL do. This is a separate issue though.)

First, the accelerator and memory objects/types should be tackled I guess. If impliciteness becomes added, then the question is whether the more simple API should be a separate layer on-top of Alpaka instead of refactoring existing code only. A separate layer would also simplify legacy as long as possible, but creates more code. There are changes like buffer or kernel interface that involve refactoring of the Alpaka core though.

We probably have to evolve through multiple designs, but I would love to see this alive, because especially with the C++17 features it should be doable that Alpaka becomes a modern, productive interface like SYCL (or even better and more performant).

So how you would like to proceed? I guess, first we need the whole design and its pitfalls, before we can implement the actual thing.

sbastrakov commented 4 years ago

My 2 cents on this topic.

My understanding is that Alpaka deliberately introduces API as free functions, and not in the object-oriented style, so that e.g. putting a task to a queue looks like simple::queue::enqueue(queue, kernel); and not like queue.enqueue(kernel); (using the simple API from this issue, but same for existing Alpaka). The free function-style API makes it harder to define the Queue interface for a user, and involves simply more typing and errors with wrong namespaces used, as compared to the object-oriented one. I think the function style has a potential advantage of substituting default parameters better. Theoretically, there may be a default queue for each device, CUDA-style, and if the queue is not specified, a default one will be used. This would make it easier for new users and simple examples, where only 1 queue is needed. However, currently Alpaka does not utilize it at all, and I guess it does not fit to its explicit specify-each-detail style of interfaces. I assume there were discussions on that during the original development, that I am simply not aware of, and the matter is more complicated than I just wrote, just sharing my thoughts.

j-stephan commented 4 years ago

If (big if) we are basing this on the SYCL API, is there any reason to not just turn alpaka into an actual implementation of SYCL? Minus OpenCL interoperability because that would require a working OpenCL implementation on the executing system.

ax3l commented 4 years ago

We discussed this before in an issue and it's definitely possible, just a matter of priorities of the project and available resources.

bernhardmgruber commented 4 years ago

On simplifying namespaces in alpaka: https://github.com/alpaka-group/alpaka/issues/1034

bernhardmgruber commented 4 years ago

On the Idx type: https://github.com/alpaka-group/alpaka/issues/1035

frobnitzem commented 3 years ago

+1 for using object-oriented style. Especially:

This morning I wasted an hour trying to write a class method that creates a kernel and enqueues it. classInstance.runKernel(queue) can't be defined, because the classInstance needs to know Acc and not just alpaka::Dev<Acc>. Apparently, there is no way to get Acc or Dev from alpaka::Queue queue, since alpaka::Queue doesn't exist. Honestly, why not define alpaka::Queue for each accelerator by template specialization? Then at least type matching on the function argument would work. The examples don't document the expected idioms very well.

sbastrakov commented 3 years ago

I agree with the general sentiment. And judging by this topic I guess most of alpaka contributors do.

I think the main difference is not between queue.enqueue(kernel) vs enqueue(queue, kernel) style. If Queue was a concrete type (not a template taking Acc and property types), those would be not that different. Or in the sense of the Interface principle would be both part of that imaginary class Queue.

I feel a larger issue is that there is no Queue concrete class. It is a kinda concept, but not really, and we are in C++14. And most alpaka's abstraction classes are in this state. And so since almost nothing in alpaka is a concrete class, alpaka pushes user's code interacting with it to take one alpaka type as a template parameter (e.g. Acc or Queue) and derive the rest from it when necessary. Same as our examples start with Acc type definition and then derive the rest from there.

I believe there is a way to convert between Acc, Dev and Queue types via existing traits. Dev / DevType should work fine for Queue types as input. Acc / AccType gives Acc type for given Dev type. I think if some combinaiton does not work, that would be a bug, but not a lack of support in principle.

j-stephan commented 3 years ago

With C++20 concepts on the horizon I'm now slightly in favour of the current API design. Once those can be used in alpaka I believe we can remove a lot of internal code without hurting alpaka's feature set.