Higher abstraction level?

Reneg973 commented 8 years ago

After doing some things and tests with vexcl, I am very soon in the position that I need more. Especially I'm thinking about stream processing and pipelining, overlapped transfer and more and I found there is a lot to do to make this work correctly. Searching the internet I found a good publication about an algorithmic framework called marrow (source @ bitbucket), which handles such things. But after a quick look I saw a lot of C++ drawbacks, e.g. there is no const correctness at all.

Dennis, do you think there is a way of combining these both worlds?

ddemidov commented 8 years ago

Hi Rene,

The main design goal of VexCL was to make GPGPU developement as painless as possible, but since VexCL is based on standard API(s), and does not hide the underlying objects from its users, it should be possible to use more advanced features of its backends.

stream processing and pipelining

What exactly do you have in mind? OpenCL 2.0 pipes, or organizing individual kernels/tasks into a dataflow-like structure? In the former case, I have no idea how to expose the functionality in a generic way, so for now the only choice is writing custom kernels. And in the latter one could in principle use OpenCL command queues and events to explicitly describe dependencies between tasks. There is no direct support for this in VexCL at this point either.

overlapped transfer

I think you can already do this if you dedicate one command queue for computations and the other for data transfer. vex::backend::device_vector class can enqueue read/write operations to a specific command queue.

a lot of C++ drawbacks, e.g. there is no const correctness at all.

Could you please open an issue with a specific problem?

Reneg973 commented 8 years ago

Excuse me, I was not detailed enough. I meant do you plan to Support Events somehow so that it is possible to synchronize Queues. E.g. my processing Queue enqueued some vexcl kernels. The download queue needs to wait on the last kernel event before it reads the data back to PC. The upload queue needs to wait until first kernel is finished until it is free to upload new data into the Input buffer of kernel 1. Further on kernel 1 should get signaled to start processing automatically if uploading was finished. But uploading can only start if kernel 1 is already finished and last kernel must wait as Long as the downloading Queue reads the output buffer.

Could you please open an issue with a specific problem?

No I didn't meant C++ drawbacks in vexcl, but in marrow. VexCL is very nice and I already learned a lot.

I already did some small changes like supporting pinned memory in the device vector (it's as twice as fast on my NVidia Card) and using the map() function of the opencl::device_vector.

But before modifying the opencl::kernel I want to be sure you didn't plan to modify it with synchronization events :)

ddemidov commented 8 years ago

One way I can think of making this possible, is to introduce a thin wrapper for the expression being assigned to. Say, something like vex::let(lhs_expr, event_wait_list = null, event = null). It would return a wrapper struct with an assignment operator that would delegate the work to the usual assign_expression function while providing some extra information. Then the following would work:

let(x, null, eventA) = sin(y) * cos(y); // submit the kernel, return its event in eventA
let(z, eventA, null) = x * y; // submit the kernel, make sure it starts after eventA is done

This would only make sense if the kernels are submitted to separate queues withing the same context. That is, x and z should be associated with separate queue. Or we could also provide an overload of the let() function taking a queue to submit the assignment into.

If you are willing to work on this, I can provide further implementation hints.

ddemidov commented 8 years ago

I already did some small changes like supporting pinned memory in the device vector (it's as twice as fast on my NVidia Card) and using the map() function of the opencl::device_vector.

I think this constructor of vex::vector should support allocating pinned memory directly with the appropriate mem_flags.

Reneg973 commented 8 years ago

Yes of course using pinned Memory is already possible. I just wrote a wrapper which acts like the map() function, but adds a functor parameter.

void write(const cl::CommandQueue &q, const std::function<void(T* device, size_t size)>& fnUpload)
{
  assert(q.getInfo<CL_MEM_FLAGS>() & CL_MEM_ALLOC_HOST_PTR);  // should be used with pinned memory
  T* ptr = reinterpret_cast<T*>(q.enqueueMapBuffer(buffer, CL_TRUE, CL_MAP_WRITE_INVALIDATE_REGION, 0, size() * sizeof(T)));
  fnUpload(ptr, size());
  q.enqueueUnmapMemObject(buffer, ptr);
}

So I could write sth. like this: buffer.write(q, std::bind(&CData::Load, this)); and the data loaded from harddisk is directly written into device mem.

Reneg973 commented 8 years ago

It would return a wrapper struct with an assignment operator that would delegate the work to the usual assign_expression function while providing some extra information.

If I understand correctly, the wrapper than works with an user event which is set to complete after the kernel itself has been finished. And it is not possible to forward the given event to q.enqueueNDRangeKernel(K, m_offset, g_size, w_size); within void kernel::operator()(const cl::CommandQueue& q) Am I right?

ddemidov commented 8 years ago

So I could write sth. like this: buffer.write(q, std::bind(&CData::Load, this)); and the data loaded from harddisk is directly written into device mem.

Isn't it the same as

this->load(buffer.map(), buffer.size());

?

Reneg973 commented 8 years ago

lol yes, took me 2mins to think about it. Seems I'm thinking much too complicated.

ddemidov commented 8 years ago

If I understand correctly, the wrapper than works with an user event which is set to complete after the kernel itself has been finished. And it is not possible to forward the given event to

We could introduce a method for the backend::kernel class that would set the events to take into account, similar to how its possible to configure the grid size with kernel::config. And we should also add event parameters to assign_expression().

Reneg973 commented 8 years ago

This way I could also calculate a histogram with a custom kernel within the processing queue and wait for the histo_ready_event in the download queue to transfer it back to host while the processing queue is doing other things. Sounds like a good extension for me

ddemidov commented 8 years ago

After some thought I decided not to implement vex::let() assignment wrapper. It would not cover all kinds of kernels vexcl launches (vector expressions, additive expressions, multiexpressions, reductions, etc). It would also require a lot of changes in the core code, some of which would bring an overhead even for the users not using the new functionality.

Instead, I've provided thin backend-independent wrappers for events, markers, and barriers. Using those one can create a dependency graph of arbitrary complexity in a generic way. See example in tests/events.cpp. There, x is associated with queue q1, y is associated with q2, where both queues belong to the same context. Without the barrier at line 24 the test fails (with AMD platform; NVIDIA seems to serialize the kernel execution anyway). Barrier makes q2 wait for completion of x update in q1.

Same approach could be used to make transforms overlap with compute.

Reneg973 commented 8 years ago

Looks good and understandable. Is the finish() required?

ddemidov commented 8 years ago

That finish is only there to make sure x has some predictable values before the experiment.

Cheers, Denis On Mar 19, 2016 11:01, "Rene" notifications@github.com wrote:

Looks good and understandable. Is the finish() https://github.com/ddemidov/vexcl/blob/master/tests/events.cpp#L20 required?

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/ddemidov/vexcl/issues/196#issuecomment-198663153

Reneg973 commented 8 years ago

Yes of course, but "x = 2;" is a second kernel and can only start if x already was completely filled with 1 if I understood those asynchronuous kernels correctly. So I think it's superfluous to call it.

ddemidov commented 8 years ago

Y=x has no such restriction and may start before x=1 finished. On my machine x may in fact get same memory area as in the last run, which may still contain twos. So the test may accidentally pass (without the barrier).

ddemidov commented 8 years ago

f9baf22 lets the user explicitly chose what queue to submit a transfer operation into. tests/events.cpp shows an example of using several queues to overlap compute and transfer and using events to make sure the data is actually ready.

Reneg973 commented 8 years ago

Okay ;), strange I didn't miss it but it's of course useful. Afaik the command_queue vector in your copy functions can also be const, and you could merge both write/read functions by using a default empty vector

ddemidov commented 8 years ago

the command_queue vector in your copy functions can also be const

It won't work for Boost.Compute backend, which is stricter than Khronos C++ bindings re const correctness.

you could merge both write/read functions by using a default empty vector

That could be a good idea, I'll think about it. I did not want to introduce extra checks into the original implementation of the read/write methods, but the cost should be negligible w.r.t. the OpenCL/CUDA API calls.

ddemidov / vexcl

Higher abstraction level? #196