amazon-archives / amazon-dsstne

Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models
Apache License 2.0
4.41k stars 730 forks source link

Evaluate new topK function #33

Closed rybakov closed 7 years ago

scottlegrand commented 8 years ago

??? On Jun 2, 2016 4:00 PM, "rybakov" notifications@github.com wrote:

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/amznlabs/amazon-dsstne/issues/33, or mute the thread https://github.com/notifications/unsubscribe/ARNK9qrQDy-sdZfHTHh8XYBEcaWn00gCks5qH0RrgaJpZM4Is-gM .

yefanhust commented 8 years ago

Hi, I'm Fan from NVIDIA. I'll share our latest work on topk problem via this issue.

scottlegrand commented 8 years ago

Awesome! That is fantastic. What is the algorithm here? On Jun 2, 2016 4:18 PM, "Fan YE" notifications@github.com wrote:

Hi, I'm Fan from NVIDIA. I'll share our latest work on topk problem via this issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amznlabs/amazon-dsstne/issues/33#issuecomment-223425499, or mute the thread https://github.com/notifications/unsubscribe/ARNK9ieOf7ggQRu7xlIiVLFUYOcxuiLGks5qH0i8gaJpZM4Is-gM .

yefanhust commented 8 years ago

I designed our own topk algorithm based on a tournament comparison. From our test on a batch of random arrays, the performance is 8(batch=1024) to 11(batch=128) times better than the current dsstne implementation.

rybakov commented 8 years ago

Can you share the unit test which produces the comparison metrics?

yefanhust commented 8 years ago

Sure. Please wait a moment. There's something I have to confirm with my hierarchy.

scottlegrand commented 8 years ago

Will this be source code or binary?

yefanhust commented 8 years ago

This is the work I’ve been doing for our cuda libraries roadmap and likely part of a future release. So it would probably not be appropriate to share the source code at this moment.

scottlegrand commented 8 years ago

Interesting, so is the Amazon code slower because of the bitonic sort or because not enough threads are invoked. Also:

  1. Does this return both key and value?
  2. What is the highest K value you support?
yefanhust commented 8 years ago

I designed a different algorithm which makes it easier to saturate the bandwidth (we almost reached the peak bandwidth). My original work did not include key-value pair support. But I extended and optimized my algorithm after I realized that this is required in dsstne. Now it does return key and value. In general, we can support whatever k value that is reasonable (0<k<=width). Since in dsstne it is only required for k<=128, I implemented a particular version for this case. But it shouldn't differ a lot from the general one in terms of performance (of course a very large k would harm the performance anyway).

scottlegrand commented 8 years ago

So actually, we'd like k to be as high as 512. We know we need 256 ASAP, and 512 would provide breathing room. Did you characterize the performance bottleneck in DSSTNE?

yefanhust commented 8 years ago

Yes we can do that. In fact, I first started this project without realizing the existence of dsstne. So I didn't look too much into dsstne's top k algorithm. I just plugged it into my tester to get performance numbers.

yefanhust commented 8 years ago

There is a difference between dsstne code and mine though. Let's say the inputset resides in a consecutive GPU memory of size batch_width. The output size in dsstne is batch_k. In my algorithm, I would like the output to be of size batch_width as well. The top k of each batch will be written to the first k locations of each batch. The rest of the memory will be used as working memory. Or would you rather prefer a solution that you allocate a temporary memory of size batch_width, and I write to the output of size batch*k? There should be no performance difference between these two options.

scottlegrand commented 8 years ago

Width can be pretty big. We've run networks with 10 Billion+ parameters across multiple GPUs. Therefore I would guess the former, but it was nice to be non-destructive to the outputs without allocating temporary memory. Memory is a premium for inference where you can fan it out to many cheap processors with small memory footprints.

yefanhust commented 8 years ago

Good. The former solution of keeping an output of size batch x width in GPU memory is our current setting. And yes it's non-destructive to the outputs.

scottlegrand commented 8 years ago

So you need an extra buffer of batch * width for this? That's potentially sub-optimal. That would incline me to investigate the bottleneck in my own code and address that instead.

yefanhust commented 8 years ago

No. The whole GPU memory I need is of size batch x width. There is no need for an extra buffer. This memory is served as both the working memory and output. As I said before, the top k elements of each batch will be written into the first k locations of each batch.

yefanhust commented 8 years ago

To be clear, the entire memory usage is batch x width for the input + batch x width for the output. That's all.

yefanhust commented 8 years ago

If width=1M, batch=128, then the memory usage is only ~0.5G, twice the size is ~1G, which is affordable. But this won't be the case if batch=1024. In fact, I can do better with my algorithm. If I say I will reduce the output size to batch * 2 * k, would you still think it's too costly? What's more, if you really care about the memory footprint, can I touch the input GPU memory and use it as working memory (that may change the content but if you don't need it afterwards then it should be fine)?

2016-06-03 7:36 GMT-07:00 scottlegrand notifications@github.com:

So that's an issue. I haven't looked at my code lately, but I believe the extra memory required is only batch * x. Width can be >1M per GPU. So again, I end up inclined to improve the performance of the existing code rather than incur the extra memory cost if possible. Since the current top K uses an in-register bitonic merge sort, I suspect it's not hard to drop something better in its place. A low memory footprint is a primary design goal of this framework. That said, I suspect your tournament sort in shared memory might be a real winner here.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amznlabs/amazon-dsstne/issues/33#issuecomment-223596085, or mute the thread https://github.com/notifications/unsubscribe/AJVGY0DQt65RjKE0ZGHUDQkCZvn7mwULks5qIDvpgaJpZM4Is-gM .

scottlegrand commented 8 years ago

That's the thing. There is no input GPU memory for sparse data. Look at the code. Direct sparse matrix multiply from the indices themselves. Width can be over 1M. 1M * 1024 * 4 is big. Too big. 2 * K however is awesome.

Also, consider that you need a way to do top K of top n * K where n is the number of GPUs in use.

yefanhust commented 8 years ago

I thought your sparse data is already in the output array. I can offer you a solution that uses the input array as working memory, which changes the content of the input array. Performance guaranteed for this solution. Alternatively, I can leave the input array intact, and only use ~batch * 2k (less than 5MB for k=512,batch=1024) as working memory. Performance guaranteed for most cases, but not for unfriendly input.

scottlegrand commented 8 years ago

Yes, the data is in the output array. I have no problem with the output array getting corrupted.

However, I have a problem with significant additional memory overhead. And even if the input into the network existed as a batch * width array, not all networks are autoencoders and we cannot enforce that as a constraint.

Any replacement for the existing code needs to come with a distributed version or it's not a sufficient replacement. In the current code, we perform topK on each segment of the output layer, merge them to the master process and then do a final top K of n * K pass.

Finally, the existing code uses a bitonic merge sort in register space. Is there an open-source full shared memory sort out there we could use instead? I can imagine that taking the sort down from O(n log n * log n) might address most of the issues here and remain OSS.

yefanhust commented 8 years ago

I'm a bit confused. My question was whether you are ok with the input array getting corrupted.

I can provide you with a library function that performs top k on a batch of arrays. Since you care about the memory footprint, then I won't ask for additional memory overhead but to use the input array as my temporary buffer.

For now I can provide you with a better replacement for your kCalculateTopK_kernel. The rest of the distributed MPI code you already have.

In general efficient sort algorithm for N elements would ask for O(N) temporary memory buffer. You can check out NVIDIA's cub library. In fact, for small sized sort, like in your case sorting 256 elements, bitonic merge sort is a good candidate for GPU regardless of its complexity.

scottlegrand commented 8 years ago

We don't need to sort N. Sorting 2 * K is more than enough. And with each sort, saving the top K and then only accepting elements smaller than element K ends up mostly running once over the entire list. I am confused as to why your approach is so much faster if bitonic merge sort is a good choice here. It's either a bad sort or I am just not using enough threads (32 per sort). If it's not the sort itself, there's probably a fairly easy fix in the existing code to close most of the gap. That's why I was asking if you had profiled the existing Top K code in any way.

On Sat, Jun 4, 2016 at 11:14 AM, Fan YE notifications@github.com wrote:

I'm a bit confused. My question was whether you are ok with the input array getting corrupted.

I can provide you with a library function that performs top k on a batch of arrays. Since you care about the memory footprint, then I won't ask for additional memory overhead but to use the input array as my temporary buffer.

For now I can provide you with a better replacement for your kCalculateTopK_kernel. The rest of the distributed MPI code you already have.

In general efficient sort algorithm for N elements would ask for O(N) temporary memory buffer. You can check out NVIDIA's cub library. In fact, for small sized sort, like in your case sorting 256 elements, bitonic merge sort is a good candidate for GPU regardless of its complexity.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/amznlabs/amazon-dsstne/issues/33#issuecomment-223770048, or mute the thread https://github.com/notifications/unsubscribe/ARNK9pbYIR6tMfb1VPwMgvHSwWuwU38Dks5qIcCcgaJpZM4Is-gM .

yefanhust commented 8 years ago

Bitonic merge sort is not bad when there's an optimized implementation. What's more, I believe we have a better idea of dealing with the top k problem. That's why we get the speed.

scottlegrand commented 8 years ago

So why is your method better? How can you beat an O(width) + m * O (K log K * log K) approach where m is the number of times I call the sort unless either the sort is inefficient or we're calling it too often?

Or I'm just not invoking enough threads (easy fix)...

This is why I keep asking you to profile the thing. I'd do it myself, but my free time for DSSTNE is limited.

skyw commented 7 years ago

@scottlegrand Are you gonna revisit this? See if there is a "easy fix"?

slegrandA9 commented 7 years ago

There's nothing broken here. NVIDIA has potentially faster code, but they never gave it to us. That was over a year ago. Closing.