Change your API - Githubissues

vmarkovtsev commented 7 years ago

Guys, somebody must tell you the truth...

First of all, I am the author of srcd/kmcuda, before that wrote kernels in samsung/veles and I am much engaged in CUDA development. I really think that your library is unique and brilliant in terms of performance and hardcore CUDA stuff, but it suffers from the common disease of many CUDA libraries.

The thing is, your API is poorly designed. Let's start right from transpose.h.

#include "introspect.h"
#include "index.h"
#include "equations.h"

None of these includes are really needed if the users wants to simply transpose the matrix. Create a separate umbrella header and leave tranpose.h focused on it's job.

What is c2r? What is r2c? Transposition is transposition, there are no variants! No comments telling what's the point in those. OK, nevermind namespaces, go on.

void transpose(bool row_major, float* data, int m, int n);

row_major? What's it needed for?

if (!row_major) {
    std::swap(m, n);
}

Please remove it the hell out and let the user decide if he or she must do the swapping of... wait... What is m and n?! Is m number of rows and n number of columns? Vice versa? No comments in the code, I have to guess. row_major makes the guessing especially fun. OK, go on.

Type of m and n is int. That is, good ol' 32-bit signed on sane platforms. This means I cannot transpose matrices larger than 2G rows with 2 columns (16GB). Titan X has 12GB, fresh P40 has 24GB. The size limit is reached and is actually the bottleneck. OK, go on.

Let's look inside transpose_fn().

int* temp_int;
cudaMalloc(&temp_int, sizeof(int) * m);
detail::scatter_permute(detail::c2r::scatter_postpermuter(m, n, c), m, n, data, temp_int);
cudaFree(temp_int);

This is a memory allocation in a library which names itself "inplace". I do understand that it may be impossible to live without it, but you must state such things very clearly everywhere you can. Otherwise your user can allocate a giant matrix in his full 12GB and then wonder why "inplace" transposition throws him OOM. The buffer should be included into the very signature of transpose() so that it is possible to reuse some existing buffer in memory constrained programs (often true with CUDA). And internal allocations prevent your code from running in streaming mode on multiple GPU cards.

The allocation error is not checked (as the rest of the errors). The function returns "void". If something goes wrong (a bug inside an internal kernel?), the user will have to guess.

Finally, we see some debug output artifact

//std::cout << "Doing C2R transpose of " << m << ", " << n << std::endl;

Your function must log it's progress. That's why one usually includes some verbosity parameter into the function's signature and do std::cout after checking that parameter.

Summary

Ideal transpose.h:

#pragma once
#include <cstdint>

enum InplaceTranspositionStatus {
  inplaceTranspositionStatusSuccess,
  inplaceTranspositionStatusInvalidArgument,
  inplaceTranspositionStatusAllocationError,
  inplaceTranspositionStatusInternalError
}

enum MatrixStorageKind {
  matrixStorageKindRowMajor,
  marixStorageKindColumnMajor
}

InplaceTranspositionStatus inplace_transpose(
   MatrixStorageKind input_storage, uint32_t rows, uint32_t cols, int verbosity,
   float *data, void *buffer);

and the implementation:

if (verbosity > 1) {
  std::cout << "Doing C2R transpose of " << rows << ", " << cols << std::endl;
}
if (data == nullptr) {
  if (verbosity > 0) {
    std::cout << "blah-blah nullptr data" << std::endl;
  }
  return inplaceTranspositionStatusInvalidArgument;
}
if ((cols < 32 && input_storage == ...) || (rows < 32 && input_storage == ...)) {
  return skinny_transpose(...);
}
bool allocated = false;
if (buffer == nullptr) {
  int status = cudaMalloc(&buffer, sizeof(uint32_t) *rows);
  if (status != cudaSuccess) {
     if (verbosity > 0) {
       std::cout << "blah-blah CUDA allocation failure with size %zu errcode %d: %s" << std::endl;
     }
     return inplaceTranspositionStatusAllocationError;
  }
  allocated = true;
}
// do stuff
if (allocated) {
  cudaFree(buffer);
}
return inplaceTranspositionStatusSuccess

bryancatanzaro commented 7 years ago

Hi Vadim - Thanks for the constructive feedback. I agree with your criticism. This code is research quality code intended to accompany our PPoPP2014 paper detailing the new algorithm it implements. I've never bothered to turn this into real software that could be incorporated into other projects because the intent of open-sourcing it was primarily to enable scientific reproducibility.

However, I'm open to your suggestions. Could you make a pull request that implements these proposed changes?

vmarkovtsev commented 7 years ago

Sure, I will do my best to finish it this week.

bryancatanzaro / inplace

Change your API #2

Summary