ddemidov / vexcl

VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP
http://vexcl.readthedocs.org
MIT License
701 stars 81 forks source link

Run "mba" in OpenCL #126

Open skn123 opened 10 years ago

skn123 commented 10 years ago

Currently in mba_benchmark.cpp, we see that the Spline creation module is run in CPU while the interpolation is done in GPU. This makes sense. However, would it be possible to make this portion (i.e., Spline creation module) work in GPU?

ddemidov commented 10 years ago

I don't remember all the implementation details right away, but it seems it would be possible to do the setup phase in OpenCL. It could even make sense because all underlying structures are regular.

It will, however, take some time, because I am a bit busy at the moment.

skn123 commented 10 years ago

Denis, The setup was done using lot of C++11 code :) That's why I was having a hard time understanding it, but I have a suspicion that it can be done. Let me know if I can help in any way.

ddemidov commented 10 years ago

I think it should be enough to implement the control lattice structure with OpenCL. See the referenced paper for the details of the algorithm.

The biggest problem is that it needs to be done in generic way w.r.t. the number of dimensions, so one would need to do some OpenCL code generation.

skn123 commented 10 years ago

Why not take the FFT route? Handcraft those kernels for 1,2 and 3D and make it work on the device for these 3 dimensions. For all other dimensions, it can continue to use the current Host option?

ddemidov commented 10 years ago

I don't like the idea of keeping separate (but very similar) kernels when they all may be generated from a single source.

skn123 commented 10 years ago

If you are referring to the basic BA algorithm in the paper, then there are loops that can easily be "OpenMP-fied" even in the current implementation.

ddemidov commented 10 years ago

Yes. Feel free to provide a pull request :).

ddemidov commented 10 years ago

So I looked at mba implementation a bit closer. Now I know why I decided to stay on the CPU for the initialization.

First, and least important, VexCL supports parallel work with multiple compute devices. Since MBA may take a random set of coordinates to get interpolated values at, each device has to hold a complete copy of the control lattice phi. It did not seem right to either duplicate the initialization work for each device, or do initialization on the first device and then transfer the lattice to the rest of the devices.

Second, in the initialization loop over the data points here temporary arrays delta and omega are updated at some random positions. This could lead to data races both with OpenMP and OpenCL parallelizations and would require use of atomic operations. This would negatively affect performance.

For example, take a look at c217b227409a70844b24eeabd8559f3da857aa32. Here are results of mba_benchmark with and without openmp parallelization:

No OpenMP

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:             8.889 sec.] (100.00%)
[  generate data:     0.047 sec.] (  0.53%)
[  GPU:               3.711 sec.] ( 41.75%)
[    setup:           2.320 sec.] ( 26.10%)
[    interpolate:     1.385 sec.] ( 15.58%)
[  CPU:               5.131 sec.] ( 57.72%)
[    setup:           2.324 sec.] ( 26.14%)
[    interpolate:     2.806 sec.] ( 31.57%)

OpenMP

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:            17.936 sec.] (100.00%)
[  generate data:     0.047 sec.] (  0.26%)
[  GPU:              12.662 sec.] ( 70.60%)
[    setup:          11.137 sec.] ( 62.09%)
[    interpolate:     1.519 sec.] (  8.47%)
[  CPU:               5.226 sec.] ( 29.14%)
[    setup:           2.322 sec.] ( 12.95%)
[    interpolate:     2.903 sec.] ( 16.19%)

Note that setup now takes 11 seconds instead of 2.3.

So unless I did something wrong here, it seems its better to leave the current MBA setup as it is.

ddemidov commented 10 years ago

Ok, I did something wrong here. After replacing critical section with atomic in ecad92c. mba_bechmark output looks like this:

1. Capeverde (AMD Accelerated Parallel Processing)

surf(0.5, 0.5) = -4.48714e-05
surf(0.5, 0.5) = -4.48714e-05

[Profile:             8.834 sec.] (100.00%)
[  generate data:     0.046 sec.] (  0.52%)
[  GPU:               3.642 sec.] ( 41.23%)
[    setup:           2.251 sec.] ( 25.48%)
[    interpolate:     1.386 sec.] ( 15.69%)
[  CPU:               5.145 sec.] ( 58.25%)
[    setup:           2.327 sec.] ( 26.34%)
[    interpolate:     2.818 sec.] ( 31.90%)

This is better than serial version, but only slightly. This also introduces requirement that iterators to coordinates and values of data points are pointing at continuous chunks of memory. I am not sure the neglectable speedup worth it. What do you think?

skn123 commented 10 years ago

If I look at BA algorithm in the paper, there are three main loops. If I understand correctly, the innermost loop can be parallelized using Boost-Compute (or even Bolt) as it is potentially data parallelizable. The third loop where we eventually compute the \phi is anyway parallelizable. If we assume data parallelizable, then barring the boundary points, \delta_{i+k}{j+l} can be computed locally. There won't be any need for inter data communication except at the boundary locations. \delta and \omega at these locations can be done serially once inner locations have been computed. Your other concerns regarding multi-device operations is valid. Hence, a mechanism should be available to use this parallel version (either as an ifdef or some user-specified input). That being said, anything that utilizes any form of parallelizaltion (CPU and GPU) is always welcome. Do you think I am correct in my assumptions?

skn123 commented 10 years ago

I have another question regarding mba_benchmark,cpp vex::multivector<double, 2> C(ctx, n); vex::vector Z(ctx, n);

    vex::copy(x, C(0));
    vex::copy(y, C(1));

    prof.tic_cl("interpolate");
    for(size_t i = 0; i < m; ++i)
        Z = surf(C(0), C(1));

Now, is this the only way to copy x and y to the device? Would it be possible to allocate values to C(0) and C(1) directly without having to first allocate them on the host?

ddemidov commented 10 years ago

There are no host-allocated structures in the snippet you provided. Both vex::multivector and vex::vector are device structures. And those are allocated directly.

If what you meant to ask is if its possible to initialize newly created device vector with a host vector data, then the answer is yes, it is possible. See the list of vex::vector constructors.

The mba_bechmark does look a bit ugly in this regard. 56e236a fixes that.

ddemidov commented 10 years ago

Regarding the MBA algorithm, if you have a closer look at BA algorithm on page 4 of the paper, you'll notice that there is an outer loop over scattered data points (_for each point ( x_c , y_c , zc) in P do), which updates accumulator arrays delta and omega on each iteration at random locations. This is the problem I was talking about earlier which would lead to a data race. It would require the use of atomic operations which could possibly kill the performance gains of parallelization.

skn123 commented 10 years ago

Yes, indeed that is the case of (for each point ( x_c , y_c , z_c) in P do) and that is what I meant by "data parallelism". If I were to use a CPU as an example, I would divide the data into k-parts (where k is the # of cores) and run BA on each of the control points within each part, except for the boundary points. This is a classic case of parallelizing region merging algorithms using the "Union-find" method. If you do using atomic, indeed it will kill the performance. The advantage of spline interpolation is that it only requires points from neighbors (1st 2nd 3rd..etc) and not all the regions. I don't know how to go about doing data parallelism in GPU :)

Now, for your previous comment: Yes, the mba_benchmark looks much cleaner now. However, my question still remains unanswered. Let me explain my situation. I have a structured image grid having dimensions inSize[0] * inSize[1] * K. I would like to fill up x, y and z with the respective indices. If I were to do this using OpenMP, I would do it the following way:

pragma omp parallel for

for (size_t i = 0; i < n; ++i) { x[i] = (float)(i % inSize[0]); y[i] = (float)((i - inSize[0]) / inSize[0]) % inSize[1]; z[i] = (float)mSortedBands[((i - inSize[0]) / inSize[0] - inSize[1]) / inSize[1]];
}

where mSortedBands contains index values Ideally, I should be able to do the whole thing in a Kernel. If that is the case, I would not have to initialize the x, y or z values on the host and directly go to the device.

Any pointers on how one can do this in VexCl? Maybe this is the right time to write my first OpenCL (or should I say, VexCL) kernel :)

Also, why are not freeing the x, y vectors in mba_benchmark? How do the device vectors get deallocated? Are they smart pointers?

ddemidov commented 10 years ago

Ok, the question about data allocation is a lot clearer now. You could do this:

vex::vector<float> x(ctx, n), y(ctx, n), z(ctx, n);
auto i = vex::element_index();
x = i % inSize[0];
y = ((i - inSize[0]) / inSize[0]) % inSize[1];

Not sure about mSortedBands, but if it was a vex::vector<int>, you could do

z = vex::permutation(((i - inSize[0]) / inSize[0] - inSize[1]) / inSize[1])(mSortedBands);
ddemidov commented 10 years ago

Regarding a K-way split of input data, how would you do it on CPU? Would each core skip points that do not belong to its subdomain? Or would you do a sort-by-key first, where key is the subdomain each point belongs to? It seems that on a GPU only second of these options would make sense, but then it has worse algorithmic complexity than the original operation.

skn123 commented 10 years ago

Perfect, then the only thing that would be of interest in this implementation would be the aspect of data parallelization. To understand this concept, take a look at Fig 3 in this paper: http://crd.lbl.gov/assets/pubs_presos/Harrison-EGPGV2011.pdf

ddemidov commented 10 years ago

Regarding the deallocation: vex::vectors behave in the same way std::vectors do. They deallocate themselves when going out of scope. That's an example of RAII idiom.

skn123 commented 10 years ago

So in the mba_benchmark do I explicitly have to state p.clear(), v.clear() and so on...?

ddemidov commented 10 years ago

No, you just let them go out of scope. No memory will leak.

ddemidov commented 10 years ago

The technique described in the paper by Harrison et al (and domain decomposition in general) is suitable for fat cluster nodes or CPU cores. This is an example of coarse-grain parallelism. GPUs on the other hand, have fine-grained parallelism, where each thread is assigned to a single data point (e.g. matrix or vector element). So I don't think this approach could be used here.

skn123 commented 10 years ago

Thanks for the terms :) I always wondered why we cannot use GPU for coarse grained parallelism. However, the parallelism I am hinting at can (??) be achieved by the process of interleaving. As indicated in Sec 3.2 "...Since each data point in P influences a set of 4 x 4 neighboring control points..." I need to ensure that when I am running the outer loop, I am only updating those control points which are not influenced by neighbors. The only way this would be possible is to use a multi-grid mechanism. Let us assume I have points x \in [0, M-1], y \in [0, N-1], then Grid 1: x_1 \in [0 : 4 : M -1], y_1 \in [0 : 4 : N -1] ,...and so on for each dimension Grid 2: x_1 \in [1 : 4 : M -1], y_1 \in [1 : 4 : N -1] ,...and so on for each dimension Grid 3: x_1 \in [2 : 4 : M -1], y_1 \in [2 : 4 : N -1] ,...and so on for each dimension Grid 4: x_1 \in [3 : 4 : M -1], y_1 \in [3 : 4 : N -1] ,...and so on for each dimension All the other operations within the main loop are localized (i.e., no communication between processes). As per your notation, I can (??) run a fine-grained parallelization for Grid_i and then sum them all up (because that's what is actually happening in the innermost loop. What do you think? Have I interpreted this correctly?

ddemidov commented 10 years ago

Could you provide a working openmp prototype for your idea? Just for the main loop over data points on page 4 with some random input.

skn123 commented 10 years ago

Denis I had an even closer look at the paper (after all this) as I was convinced of what I mentioned. Take a look at the last paragraph of Page 3 in the paper (right had column). "...In general, we resolve multiple assignments....". Every control point \phi is dependent on data points in its 4 x 4 neighborhood. Lets take an analogy of median filtering (5 x 5) and as we know, they are embarassingly parallel. Now, to check this, I browsed through the web and came up with this codebase that is there on github: https://github.com/SINTEF-Geometry/MBA

Check the functions:

void MBA::BAalg()
inline void ijst(int m, int n, double uc, double vc, int& i, int& j, double& s, double& t)

and

inline void WKLandSum2(double s, double t, double w_kl[4][4], double& sum_w_ab2) 

I have extracted all the relevant codes pertaining to BA algorithm only. We are not assuming UNIFORM_CUBIC_C1_SPLINES

for (int ip = 0; ip < noPoints; ip++) 
{

    // Map to the half open domain Omega = [0,m) x [0,n)
    // The mapped uc and vc must be (strictly) less than m and n respectively
    double uc = (data_.U(ip) - data_.umin()) * interval_normalization_factor_u; 
    double vc = (data_.V(ip) - data_.vmin()) * interval_normalization_factor_v;

    int i, j;
    double s, t;
    UCBspl::ijst(m_, n_, uc, vc, i, j, s, t);

    // compute w_kl's and SumSum w_ab^2 here:
    double w_kl[4][4];
    int k,l;
    double sum_w_ab2_inv = 0.0; 
    UCBspl::WKLandSum2(s, t, w_kl, sum_w_ab2_inv);

    sum_w_ab2_inv = double(1) / sum_w_ab2_inv;

    double zc = data_.Z()[ip];

    // check p. 231: k=(i+1) - flor(xc) and l = ...
    for (k = 0; k <= 3; k++) 
  {
      for (l = 0; l <=3; l++) 
    {
      // compute phi_kl with equation (3)        
      double tmp = w_kl[k][l];

      // 1. Originally
      double phi_kl = tmp * zc * sum_w_ab2_inv;

      // 2. Alternatively, to let it tapper of more smoothly (but more efficient if permantly) 
      //double t = 0.8; double phi_kl = (1.0-t)*tmp*zc/sum_w_ab2 + t*zc;

      // 3. Alternatively, with linear term
      // double alpha = 0.01; double phi_kl = (tmp*zc + alpha*(tmp - sum_w_ab2)) / sum_w_ab2;

      // And alternatively for equation (5):
      // from |w_kl|^2 to |w_kl| to get a weighted average
      // just skip the next statement

      tmp *= tmp;

      delta_(i+k,j+l) += tmp*phi_kl;
      omega_(i+k,j+l) += tmp;

      }
    } 
}

// s,t \in [0,1) (but special on gridlines m and n)
// i,j \in [-1, ???
inline void ijst(int m, int n, double uc, double vc, int& i, int& j, double& s, double& t) 
{
  //int i = std::min((int)uc - 1, m-2);
  //int j = std::min((int)vc - 1, n-2);

#ifdef  UNIFORM_CUBIC_C1_SPLINES
  i = 2*((int)uc) - 1;
  j = 2*((int)vc) - 1;
#else
  i = (int)uc - 1;
  j = (int)vc - 1;
#endif

  s = uc - floor(uc);
  t = vc - floor(vc);

  // adjust for x or y on gridlines m and n (since impl. has 0 <= x <= m and 0 <= y <= n 
#ifdef  UNIFORM_CUBIC_C1_SPLINES
  if (i == 2*m-1) {
    i-=2;
    s = 1;
  }
  if (j == 2*n-1) {
    j-=2;
    t = 1;
  }
#else
  if (i == m-1) {
    i--;
    s = 1;
  }
  if (j == n-1) {
    j--;
    t = 1;
  }
#endif
}

inline void WKLandSum2(double s, double t, double w_kl[4][4], double& sum_w_ab2) 
{
    sum_w_ab2 = 0.0;
    double Bs0 = B_0(s); double Bt0 = B_0(t);
    double Bs1 = B_1(s); double Bt1 = B_1(t);
    double Bs2 = B_2(s); double Bt2 = B_2(t);
    double Bs3 = B_3(s); double Bt3 = B_3(t);

    double tmp;

    // unrolled by Odd Andersen 15. dec. 2003, for optimization
    tmp = Bs0 * Bt0; w_kl[0][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt1; w_kl[0][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt2; w_kl[0][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs0 * Bt3; w_kl[0][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs1 * Bt0; w_kl[1][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt1; w_kl[1][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt2; w_kl[1][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs1 * Bt3; w_kl[1][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs2 * Bt0; w_kl[2][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt1; w_kl[2][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt2; w_kl[2][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs2 * Bt3; w_kl[2][3] = tmp; sum_w_ab2 += tmp * tmp;

    tmp = Bs3 * Bt0; w_kl[3][0] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt1; w_kl[3][1] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt2; w_kl[3][2] = tmp; sum_w_ab2 += tmp * tmp;
    tmp = Bs3 * Bt3; w_kl[3][3] = tmp; sum_w_ab2 += tmp * tmp;

//     int k,l;
//     sum_w_ab2 = 0.0; 
//     for (k = 0; k <= 3; k++) {
//  for (l = 0; l <=3; l++) {
//      double tmp = w(k, l, s, t);
//      w_kl[k][l] = tmp;
//      sum_w_ab2 += (tmp*tmp); 
//  }
//     }
}
ddemidov commented 10 years ago

I was not saying that you are wrong, I merely asked a working prototype of your parallelization idea to see what effect on performance and memory footprint it would make. I don't see it in the code snippet you provided.

Cheers, Denis

skn123 commented 10 years ago

I will take a crack at it. I would not, however be able to translate this it to a VexCL format, but I think I can make it using OpenMP.

ddemidov commented 10 years ago

An openmp implementation should be enough. Also, you don't need to implement the full BA algorithm. A parallelization of the loop on p.4 is enough.

skn123 commented 10 years ago

Since you do not like "debug" environments (Read somewhere else that you like to plan your code incrementally !) let us discuss this here :) I can foresee that parallelization may lead to increased memory load, but if it is worth it then so be it.

The first part of the loop in my previous comment is perfectly parallelizable as it does not depend on anything else (i.e., only dependent on u, v / i, j)

The problem arises with the innermost loop. So let us address that first.

Suppose I have a grid size of (m,n); How I would go about doing this would be as follows: a.) Create empty \delta and \omega arrays b.) As each control point depends on only its 4 x 4 neighborhood, I will divide the outer loop into two for loops (dunno how VexCL would do it though) as: for (int i_idx = 0+offset1; i_idx < m; i_idx+=4) // offset1 = 0, 1, 2, 3 for (int j_idx = 0+offset2; j_idx < n; j_idx+=4) // offset2 = 0, 1, 2, 3 { run the entire pipeline in Pg 4 (including the inner loop) and update all the corresponding locations of \delta \omega matrices for all pixels in parallel. Here, we are only updating 1/16 of the original grid size. As we have allocated memory for \delta and \omega we can safely update them concurrently without having any race conditions as the grid points being updated in this index location (i, j) are completely outside the influence zone of other corresponding grid locations. } c.) Now, if this can be ported to a kernel, we need an outer loop that would address all the 16 offsets that will arise. This would then be very similar to how you write your interpolation loop by calling the kernel 100 times in mba_benchmark.cpp

So, a pseudocode for this would be (if I assume integer grid locations for now)
std::map<int, std::pair<int> > offsets (i.e., 0 : 0,0, 1 : 0,1, ..and 15 : (3,3) )
//explicitly isolate the indices in host or do it implicitly in the device. Store the indices in a std::map<> myIndices
//Allocate memory for \delta and \omega
for (i = 0; i < offsets; ++i)
{
 // OpenCL Kernel operation
  BAAlg(myIndices[i], \delta, \omega, \grid_data) (if done explicitly)
  or
  BAAlg(i, \delta, \omega, \grid_data) (if done implicitly)

I have written a C++ function. I presume \grid_data will be copied to the device prior to calling this function, and so will \delta and \omega
}

Does this make sense?

ddemidov commented 10 years ago

So you will need to run the strided loop in (b) 4^ndim times where ndim is the number of dimensions (2 in your case). I suspect that each run would take as much time as a non-strided one due to caching issues. So the whole thing would possibly run 4^ndim times slower than serial one.

On the other hand, now you should be able to run the strided loop in parallel on each core of your CPU. This should not give you linear speedup though, because it seems to be memory bound and memory bandwidth does not scale as well as arithmetics. For example, from my experience with SPMV, openmp will run twice as fast as serial code irregardless of number of cores used.

So in total I expect it to be slower than serial code, but one can only be sure after some experimenting. That's why I would like to see some compilable, runnable, and measurable code.

ddemidov commented 10 years ago

Also, could you please enclose the code snippets with code markers as in

    int foo = 1;

This would increase readability and look like this:

int foo = 1;
skn123 commented 10 years ago

Will do so; Not used to posting here

skn123 commented 10 years ago

How do you say the strided loop method will run slower? As I have stated, I segregate the indices that would like the pipeline to run and since, they are "mutually exclusive" without any race conditions in the innermost loops, I should see (some form of) amortized speed up. Of course, if I am using openMP, I will need to store the indices so that I have the index locations in memory. I have put those loops there only for illustrative purposes. If you see the pseudocode I explicitly talk about passing an offset value indicating which set of indices I will be examining for the current offset level.

Besides, if I openMP-fy the snippet, I can only openmp-fy each strided loop. The offset movement will still have to be serial, unless we have 16 such \delta and \omega wherein we can also parallelize that and do a summation at the end.

Can you explain how you suggested that it would be slower? I would like to understand this myself.

ddemidov commented 10 years ago

When you do strided access to memory, cache loads the whole cache line anyway. So a strided loop will have to read the same amount of memory as a full one. And you would have to do those several times. A simple example:

#include <iostream>
#include <vector>
#include <boost/timer.hpp>

int main() {
    const size_t n = 16 * 1024 * 1024;

    std::vector<double> x(n, 1), y(n, 0);

    // Full loop
    {
        boost::timer t;
        for(size_t i = 0; i < n; ++i)
            y[i] += 42 * x[i];
        std::cout << "Full loop: " << t.elapsed() << std::endl;
    }

    // Strided loop
    {
        const size_t stride = 4;
        boost::timer t;
        for(size_t j = 0; j < stride; ++j)
            for(size_t i = j; i < n; i += stride)
                y[i] += 42 * x[i];
        std::cout << "Strided loop: " << t.elapsed() << std::endl;
    }
}

This, when compiled with g++ -o hello hello.cpp -O3, outputs:

Full loop: 0.037049
Strided loop: 0.118635

So the strided loop is 3.2 times slower than full one.

skn123 commented 10 years ago

In that case, would it be prudent to increase the memory footprint (in this case 16 \omega and \delta matrices) and also run them all in parallel? I guess this would be what you call coarse grained parallelism. So a situation would be that the outer loop is running via OpenMP while the inner loop is running on the device ! (i think not possible but correct me anyway :) ! )

ddemidov commented 10 years ago

The memory is usually the scarcest resource on a GPU. Having to allocate 16 times more memory than necessary (in 3D that's 64 times!) would greatly reduce your maximum problem size.

skn123 commented 10 years ago

Do you see any way of improving the caching problem? Sorry, I am not a Comp Sci expert in anyway so please excuse this dumb question. Let us assume the most simplistic case where the data to be analyzed is available as a contiguous block in memory and copied to the GPU before the outer loop begins.

ddemidov commented 10 years ago

Well, I only can think of using atomic operations for accumulator arrays (delta and omega). That could be faster than either strided access or writing to several copies of the arrays, because either way results in increased memory I/O. But I did not try this.

What is your use case for the MBA algorithm? In my case I set it up once (on a CPU) and then do many interpolation operations (on a GPU). Thus, the setup cost is amortized and is not important anymore.

skn123 commented 10 years ago

It is mainly to do with scattered data interpolation where the scattering changes after a few intervals.

ddemidov commented 10 years ago

Did you measure the actual times spent in setup and interpolation?

skn123 commented 10 years ago

BTW, I did a check with your code snippet. The times you have shown match. However, if I interchange the loops, the times become equal (i.e., stride is ~= serial)

ddemidov commented 10 years ago

Well, with the interchanged loops the access pattern is the same in both cases, so that is to be expected.

skn123 commented 10 years ago

So, can this be translated to the BA module?

ddemidov commented 10 years ago

I don't think so. When you interchange the loops, then writes to omega and delta are no longer independent.

skn123 commented 10 years ago

But I don't understand a fundamental aspect here; The fact is that the write to \omega and \delta are (going to be) done in the GPU. So, won't there be an advantage there? In fact the entire BA pipeline will be done in the GPU, but using strides. The example that you have shown illustrates a CPU aspect and I understand that there will be caching problems. However, the entire data has been transferred to the GPU at the onset and all \omega_i and \phi_i will be updated in parallel on the GPU (but within a stride).

ddemidov commented 10 years ago

Usually GPUs have about 6-10 times more memory bandwidth than CPUs. With strided access you will do 4^ndims more memory I/O operations than necessary, so that could be counteproductive.

I think I'll try an implementation with atomic operations when I have more time.

skn123 commented 10 years ago

Meanwhile let me try this with OpenMP

skn123 commented 10 years ago

Here is the timing when using the current version of mba_benchmark E:\Binaries_MinGW\vexcl\examples>mba_benchmark.exe size : 33554432

  1. Cayman (AMD Accelerated Parallel Processing)
  2. Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz (AMD Accelerated Parallel Pro cessing)

^C E:\Binaries_MinGW\vexcl\examples>mba_benchmark.exe size : 33554432

  1. Cayman (AMD Accelerated Parallel Processing)
  2. Intel(R) Xeon(R) CPU E5-1603 0 @ 2.80GHz (AMD Accelerated Parallel Pro cessing)

surf(0.5, 0.5, 0.5) = -0.000214423

Profile: 1105.916 sec. generate data: 3.840 sec. GPU: 1102.076 sec. setup: 1098.016 sec. interpolate: 3.530 sec.

E:\Binaries_MinGW\vexcl\examples>

The data size = 16 * 2048 * 1024

skn123 commented 10 years ago

So it would be interesting to see if the setup time can be reduced. I am working on the OpenMP implementation but it will be different from your C++11 based stuff.

ddemidov commented 10 years ago

But that's timing for VexCL's benchmark. I was more interested in timings for your own problem (I was secretly hoping that setup takes negligible fraction of time for your use case).

skn123 commented 10 years ago

My problem is a little bit more involved. First of all a couple of questions: a.) In the original paper, we have the variables "s, t", which I presume lie between 0,1. b.) If that is the case, then we see that they are computed from x and y as s = x_c - \floor(x_c). How would that be possible if we have integer grid points? I can get away for i,j (in \phi_i,j ) but for s,t I would need something else. That is what I am trying to figure out. Any thoughts?