arrayfire / arrayfire-rust

Rust wrapper for ArrayFire
BSD 3-Clause "New" or "Revised" License
813 stars 57 forks source link

Add code examples #52

Open 9prady9 opened 8 years ago

polarathene commented 7 years ago

Could an example(s) be added for those new to programming for GPU's? I've done a bit of OpenCL logic for adding an algorithm to the Hashcat project, but there was more freedom compared to how ArrayFire seems to restrict you to working with arrays?

// FNV-1a (32-bit) hashing algorithm
fn main() {
    let key = [0x68, 0x65, 0x6c, 0x6c, 0x6f]; // "hello"
    let mut hash: u32 = 2166136261; // Hash init value defined by algorithm

    for byte in key.iter() {
        hash ^= *byte as u32;
        hash *= 16777619; // Another magic number defined by the algorithm
    }

    println!("Hash is: {:x}", hash); // 4f9f2cab
}

Really simple code, take a string key as bytes and XOR them + multiply hash by a fixed value iteratively. With a large set of keys, how do you approach this in ArrayFire? Do you treat each key as a 1D array(column/ horizontal?) and rows of keys? If so does this mean I can only work on fixed string key lengths? hash would be used as a constant to fill another array for the operations?

This example would make me think that I should have an array for each letter position(5 in this case + 1 for the hash)? Also curious about optimizing/setting up the array and data right, no idea if 3rd/4th dimensions are helpful for this or row/column optimal sizes for performance, my actual algorithm is doing about 32 billion hashes a second with OpenCL and Hashcat, I'm wanting to see how ArrayFire compares.


I have a slightly more complicated hash algorithm to implement which starts with a while(length >= 24) operating on 24 bytes of the string key at a time for 3 64-bit unsigned integers. After that it handles the remaining <23 bytes a little differently. Both stages end with a call to an expensive function involving the 3 64-bit values being "mixed" with subtraction, XORing and bitshifting(both directions).

I can share C/OpenCL/Rust implementations I've done, if it'd be worthwhile how a fairly straightforward algorithm with loop/conditional/arithmitic diffs/ports to ArrayFire?

polarathene commented 7 years ago

I've had a try at porting the simple algorithm, wasn't too difficult with the one fixed key. Tried to add some dynamic strings and ran into memory issues somewhere between 625 and 3125 elements spread over 5 arrays. Code and discussion can be found on this Rust reddit post.

Any tips on what I'm doing wrong and could try? Or would ArrayFire not be the appropriate choice for this type of work on the GPU? I'm thinking I need to do the permutations via ArrayFire in GPU memory via some batching/loop?

9prady9 commented 7 years ago

I think the variable hashes of type Array is incorrect. You are passing in a slice of one value and passing in dims which seems to be a vector.

On Tue, Oct 18, 2016, 4:44 PM Brennan Kinney notifications@github.com wrote:

I've had a try at porting the simple algorithm, wasn't too difficult with the one fixed key. Tried to add some dynamic strings and ran into memory issues somewhere between 625 and 3125 elements spread over 5 arrays. Code and discussion can be found on this Rust reddit post https://www.reddit.com/r/rust/comments/5836mf/working_with_arrayfire_and_large_arrays/ .

Any tips on what I'm doing wrong and could try? Or would ArrayFire not be the appropriate choice for this type of work on the GPU? I'm thinking I need to do the permutations via ArrayFire in GPU memory via some batching/loop?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/arrayfire/arrayfire-rust/issues/52#issuecomment-254477495, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHnOr7hDiVo4JeXFdfdXb4lfQpPlEnVks5q1KnngaJpZM4GzyTy .

9prady9 commented 7 years ago

A small clarification, I was referring to the variable hashes from the code stub you posted on Reddit.

On Tue, Oct 18, 2016, 7:48 PM Pradeep Garigipati pradeep@arrayfire.com wrote:

I think the variable hashes of type Array is incorrect. You are passing in a slice of one value and passing in dims which seems to be a vector.

On Tue, Oct 18, 2016, 4:44 PM Brennan Kinney notifications@github.com wrote:

I've had a try at porting the simple algorithm, wasn't too difficult with the one fixed key. Tried to add some dynamic strings and ran into memory issues somewhere between 625 and 3125 elements spread over 5 arrays. Code and discussion can be found on this Rust reddit post https://www.reddit.com/r/rust/comments/5836mf/working_with_arrayfire_and_large_arrays/ .

Any tips on what I'm doing wrong and could try? Or would ArrayFire not be the appropriate choice for this type of work on the GPU? I'm thinking I need to do the permutations via ArrayFire in GPU memory via some batching/loop?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/arrayfire/arrayfire-rust/issues/52#issuecomment-254477495, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHnOr7hDiVo4JeXFdfdXb4lfQpPlEnVks5q1KnngaJpZM4GzyTy .

polarathene commented 7 years ago
let dims = Dim4::new(&[1, v0.len() as u64, 1, 1]);

// FNV32-1a defined starting hash value
let hash: [u32;1] = [2166136261];

// Not sure but I think I need to fill Array of the same shape for these two to compute with
let mut hashes  = Array::new(&hash,    dims);

dims which seems to be a vector.

From what I've read, I thought I could not pass vectors in and only Arrays/slices? dims is defined like any other example, I am referencing the length of a vector to fit in the length of each vectors as_slice() method.

I think the variable hashes of type Array is incorrect. You are passing in a slice of one value

Good catch :) I actually meant to populate the entire array with that value, forgot to update it. Bit odd that it was happy to take 625 length slices everywhere else and not throw an error despite the 1 element here? I'll fix this and get back to you.

Can you advise if it's a bad idea to generate the permutations of string keys on the host side, should I be handling this in ArrayFire instead somehow? 36/4 for example is 1.7 million bytes roughly.I'm wanting to target around 36^10+..

9prady9 commented 7 years ago

@polarathene My bad, what i meant was that the dims seems to indicate that the data is of vector shape (not a vector type of rust) - I have corrected earlier message to reflect this.

The data pointed by the slices is being forwarded to C-API as a pointer. So if you say the dims are of so and so length but actually pass in a slice that has single element, It is going to cause a segmentation fault.

If you intend to fill an Array with same value and that has shape specified by dims, then you should use constant function.

The common thumb rule is to keep the data on GPU as much as possible to reduce the cost of data transfers between the host and the GPU. So, if you can generate the permutations on the GPU, that will eliminate the cost of initial transfer of permutations to the GPU memory. ArrayFire doesn't have any native functions that can permutate data as of now. You may be able to implement it on top of ArrayFire, although i can't tell you how that may be done because i am not aware of the algorithms required to do so.

polarathene commented 7 years ago

@9prady9 Is the approach with an array for each character in the string correct? or should I be doing that all in one array somehow?

The string data is just bytes/numbers, I should be able to iterate through those values with some loops assuming I can tell AF to do so with a Rust loop?

How to handle the hash comparison? After calculations I'd want to compare all the calculated values against another collection/array of hashes to look for matches, should I have a 2 dimensional array filled with constants?, eg 625 elements would be 625 constant filled hash value * number of hashes, then iterate through those to get boolean(0/1) results.

Last problem was how to return that data from AF to Rust. Regardless of if I can filter it out, AF expects me to provide a fixed size array for it to store the data in? In this example do I need to manually create 625 elements to make the compiler happy?

If there are good examples you know of already in the wild using Rust that show how these problems are tackled I'd be happy to look through them.

The example algorithm is very simple without AF, I'd love to know how to properly use it with AF at scale.

9prady9 commented 7 years ago

Having a single character for one array is probably not an efficient way to do it. I would think of a way to handle bunch of chars in one single array.

I didn't understand your confusion about loops. ArrayFire runs computation on GPUs which work on data in parallel, there by avoiding the need to iterate in loops. May be if you show a small code that illustrates what you are trying to explain, i can understand better.

Rust wrapper has functions such as eq that lets you compare data in two different Arrays.

I am kind of confused as to what you mean by ArrayFire expects a fixed size array ? Because that is not the case unless i unable to understand what you are trying to explain.

Sure, we are working on adding more examples.

polarathene commented 7 years ago

TL;DR: First code example in Rust, how to convert this simple algorithm to work in ArrayFire with 32 billion calculations(this is limit of the algorithm I'm actually trying to port, Bob Jenkins Jenkins hash aka lookup8.c/lookup2, this basic algorithm should easily surpass 32 billion/sec) or more a second?

  1. Port algorithm to basic AF example
  2. Handling of string permutations in large quantities, could possibly substitute with randu method as mock data
  3. Comparing calculations to another collection of thousands of hashes and returning the matches(string key and matching hash pair).

Having a single character for one array is probably not an efficient way to do it. I would think of a way to handle bunch of chars in one single array.

I might not have been clear with what I'm trying to communicate. I'm exhausting the keyspace of a given charset(in the example code posted on reddit link, a, b, c, d was used due to the memory issue I was having, I'd intend to use much larger such as a-z, 0-9 which is 36 characters total. 2ndly the example gave a length of 5 character strings/permutations. 4^5 is 1024, 36^5 is over 60 million permutations. I was stating that each array would represent an index of that string(total 5), I took all permutation strings and split them up into individual characters into each AF array. Makes the most logical sense to me, but perhaps isn't the right way to handle it?(It should be able to handle 32 billion hashes or more a second like the OpenCL code can).

I didn't understand your confusion about loops. ArrayFire runs computation on GPUs which work on data in parallel, there by avoiding the need to iterate in loops.

I understand that, but it's not clear to me how to translate code that uses a loop/iterations to ArrayFire for GPU parallelization.

May be if you show a small code that illustrates what you are trying to explain, i can understand better.

I did earlier, here it is again:

// FNV-1a (32-bit) hashing algorithm
fn main() {
    let key = [0x68, 0x65, 0x6c, 0x6c, 0x6f]; // "hello"
    let mut hash: u32 = 2166136261; // Hash init value defined by algorithm

    for byte in key.iter() {
        hash ^= *byte as u32;
        hash *= 16777619; // Another magic number defined by the algorithm
    }

    println!("Hash is: {:x}", hash); // 4f9f2cab
}

That's it, it's very simple. It gets more complicated when I want to process much more than a single string with AF, as well as comparing if the calculated hash matches any of the ones I am comparing against(not shown). Both tasks are easy to do out of ArrayFire, I would like an example that shows how to scale this code and benefit from ArrayFire on the GPU instead of writing OpenCL kernel myself.

All this code does is take string input as bytes, iterate through it with a fixed starting value hash and XOR it, then multiply by another fixed value of the algorithm. After iterations/loop is complete, it would be compared against an array or dictionary/hashmap lookup for a match against known hashes(the strings are randomly generated unlike this example and what I am doing is hashing them until I find the correct string key that calculated the hash).

Rust wrapper has functions such as eq that lets you compare data in two different Arrays.

Yes I used this in the Reddit example...I will share it on this issue as well. We discussed the memory error might have been due to hashes being a single value in an AF array instead of constant(), which it should be as it's the array all calculations are done on from a base value and then compared against hash_list where constant() may not be right? In real case scenario hash_list would have 4,000 or more hashes to compare against every AF array element, do I make 4k constant arrays, or can I handle this better with one array containing all 4k hashes to compare each element of AF calculation results against?

extern crate arrayfire as af;
use af::*;

extern crate permutate;
use permutate::Permutator;

fn main() {
    // Defining the keyspace, keys of 5 characters long, with an unfortunately small charset?
    // Perhaps I should skip the crate and find a way to generate these through ArrayFire?
    let alphabet = ["a","b","c","d"];//,"e","f","g","h"];
    let lists = [
        &alphabet[..],
        &alphabet[..],
        &alphabet[..],
        &alphabet[..],
        &alphabet[..],
    ];
    let permutator = Permutator::new(&lists[..]);

    let mut v0: Vec<u8> = Vec::new();
    let mut v1: Vec<u8> = Vec::new();
    let mut v2: Vec<u8> = Vec::new();
    let mut v3: Vec<u8> = Vec::new();
    let mut v4: Vec<u8> = Vec::new();

    for permutation in permutator {
        let x = permutation.concat();
        // Not sure if I'm structuring the data out right here, or if the next two lines are the right way to approach it?
        let mut y: [u8;5] = [0,0,0,0,0];
        y.copy_from_slice(x.as_bytes());
        v0.push(y[0]);
        v1.push(y[1]);
        v2.push(y[2]);
        v3.push(y[3]);
        v4.push(y[4]);
    }

    // The way I understand it, for each letter in a string to iterate I should have separate Arrays to calculate in parallel?
    let dims = Dim4::new(&[1, v0.len() as u64, 1, 1]);
    let af0 = Array::new(&v0.as_slice(), dims);
    let af1 = Array::new(&v1.as_slice(), dims);
    let af2 = Array::new(&v2.as_slice(), dims);
    let af3 = Array::new(&v3.as_slice(), dims);
    let af4 = Array::new(&v4.as_slice(), dims);
    let key = vec![af0,af1,af2,af3,af4];

    // FNV32-1a defined starting hash value
    let hash: [u32;1] = [2166136261];
    // Not sure but I think I need to fill Array of the same shape for these two to compute with
    let mut hashes  = Array::new(&hash,    dims);
    let     hashmul =   constant(16777619, dims));

    // The expected hash for "hello", note won't match due to current charset
    // Could fill with `constant()`, I intend to use a large array of hashes to compare calculations to
    // Not quite sure how to approach that(iterate through 2D array of constants()?)
    let hash_list = Array::new(&[0x4f9f2cab], dims); 

    // Returns an array of of 0's & 1's based on matches.
    let result = eq(&hashes, &hash_list, false);
    // Fixed array required to store the results into? Do I have to define 625 individual values?
    let mut r_data: [u32;625] = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0];
    result.host::<u32>(&mut r_data);
    println!("Hash is: {:?}", r_data);
}

I am kind of confused as to what you mean by ArrayFire expects a fixed size array ?

AF does not need fixed size array to do it's work, it's great with slices. My problem is with the host method on AF array(see bottom of code example, r_data to have results in native rust collection afterwards. Can I convert to Vector or slice instead?

polarathene commented 7 years ago

Question that can be included in doc(not sure if rust api docs) FAQ. For developers wanting to adopt ArrayFire but not familiar with OpenCL/CUDA or programming with parallelism in mind.

When processing large datasets where you could process millions/billions(a large enough workload to keep the hardware busy at full) at a time, if the algorithm permitted working on 8 individual u8 values(bytes) vs 1 single u64 value, would it make a difference?

I know that on a smaller scale dataset, it would allow for more parallelism, but if that's not an issue due to large size perhaps there is no advantage or it would perform better on u64 values. As the end of the dataset is reached, at some point switching to individual bytes might keep the speed/performance up where it would otherwise decrease? I think I read somewhere about my GPU having effectively 32 threads?


Example from existing OpenCL algorithm

I am looking at porting over an OpenCL algorithm I wrote, to use ArrayFire. The original C algorithm was for the CPU and processed each byte individually, I noticed that it worked in chunks of 3 u64 values at a time effectively and altered the algorithm assuming that was an optimization. Each string is processed at 24 bytes at at time with the remaining <24 bytes handled a little differently.

Here are some snippets of calculations that are done:

// The function that does this calculation has 4 repetitions but the integers at the end change
// each line and repetition depends on the prior calculations I don't think this can be improved
abc.x ^= (abc.x -= abc.y, abc.x -= abc.z, abc.z >> 43);
abc.y ^= (abc.y -= abc.x, abc.y -= abc.z, abc.x << 9);
abc.z ^= (abc.z -= abc.x, abc.z -= abc.y, abc.y >> 8);

// This is the same without the vector usage and a slightly different order, far as I know they're effectively the same a/b/c are equivalent to x/y/z above
a -= b; a -= c; a ^= c >> 43;
b -= a; b -= c; b ^= a << 9;
c -= a; c -= b; c ^= b >> 8;

// Another way to express it that looks more readable, my notes say it did not perform as well, perhaps ArrayFire would allow this readable version but still optimize to perform better than my poor understanding of coding for opencl and performance
__constant u32 BIT_SHIFT[12] = {
    43, 9, 8,
    38, 23, 5,
    35, 49, 11,
    12, 18, 22
};
for (int i = 0; i < 12; i +=3)
{
     a = (a-(b + c)) ^ (c >> BIT_SHIFT[i    ]);
     b = (b-(a + c)) ^ (a << BIT_SHIFT[i + 1]);
     c = (c-(a + b)) ^ (b >> BIT_SHIFT[i + 2]);
};

abc vector or non vector a/b/c values were created fro 24 bytes like this:

// The program(hashcat) that I wrote this for broke string data into 32-bit values(4 letter chunks), with ArrayFire I have a bit more control and could use either u8 values with correct offset(bitshift) or u64 values without combining two 32-bit like below, no idea if it matters for parallelism if the the amount of strings to process is already high, so u8 values may not benefit vs u64?

// w[] is an array of 32-bit values, W64 is a vector of 3 64-bit values
W64.x = w[offset    ] + (w[offset + 1] << 32);
W64.y = w[offset + 2] + (w[offset + 3] << 32);
W64.z = w[offset + 4] + (w[offset + 5] << 32); 

Final <24 bytes was handled with conditonal:

// Last <24 bytes were assigned like above then added, original was a much larger switch statement with fallthrough and lots of bitshifting
abc.z += pw_len; // The first byte encodes the length of the original string
abc.z += (len > 16) ? (W64.z <<  8) : 0;
abc.y += (len >  8) ? W64.y         : 0;
abc.x += (len >  1) ? W64.x         : 0;

I think my OpenCL port and improvements should port over well to ArrayFire looking quite similar, x/yz vector I'm not sure what that would look like. Input would be a rows(strings) with columns of bytes(the string a row represents). The computation on those bytes(x/y/z or a/b/c in snippets) would be a seperate array I guess where 3 columns for each component might make sense, only the lazy one z/c is required as the final result.

polarathene commented 7 years ago

FAQ questions:

  1. I noticed I can do array_a = array_a * 2 but some operators like shift fail array_a = array_b << 8 requiring the shiftl() method to use. Show table of supported operators and if they allow arithmetic against numbers or AF Arrays only.

  1. Does batched being true or false matters when supplying a number, does it make a difference?

  1. array_a += 42 vs array_a = array_a + 42, does the JIT compiler optimize for this, or will the GPU array allocate memory for a new array on the 2nd example instead of updating the values like the 1st example presumably does(does it?).

  1. Is it possible to index elements in an array and update them with arithmetic like native Rust arrays and the C++ version of AF supports? Current AF index methods return a new array of what was indexed, updating those values do not update the original array values.

    • Answer?: The index used to extract values might be able to reapply them with assign_gen(), not sure how this works if you take 2nd and 4th column from a 5 column array(extracts to two columns?) and merge back. Still possibly taking up memory with new array?

  1. Profiling an ArrayFire application on the GPU:
    • CUDA backend -> NVidia Visual Profiling(NVVP) tool
    • OpenCL? (AMD profiling tools maybe? but these might only be for AMD?)
    • CPU, this like a traditional CPU program?

  1. General thumb rule is NOT to use indexing(index_gen, lookup, locate etc. ) to access individual elements from GPU memory. They are very expensive operations and should be avoided.

    • These should be listed in a table somewhere. I ran into a concerning issue where it seems using locate() too frequently is the main cause, as mentioned here and upstream.

  1. How to get a true/false result for if condition on the host/CPU side instead of an Array result.

    • Answer? Use a method like any_true_all or all_true_all, these provide a tuple of f64 values(0 or 1). It applies to all dims of the array, the non _all variants of these methods let you specify a dim but return an array of true/false(where elements can be > 1, even for all_true?), this works differently to _all methods, I think you need to run _all method over the non _all result if you want to check a specific dim.
9prady9 commented 3 years ago

I am sorry it has been quite some time since last comment was made here. Here are some updates w.r.t clarifications requested.

I noticed I can do array_a = array_a * 2 but some operators like shift fail array_a = array_b << 8 requiring the shiftl() method to use. Show table of supported operators and if they allow arithmetic against numbers or AF Arrays only.

That is because Shl traits don't implement for integral types, which they should. I have created a separate issue for this. https://github.com/arrayfire/arrayfire-rust/issues/232

Does batched being true or false matters when supplying a number, does it make a difference?

Created a separate issue for ^ https://github.com/arrayfire/arrayfire-rust/issues/233

array_a += 42 vs array_a = array_a + 42, does the JIT compiler optimize for this, or will the GPU array allocate memory for a new array on the 2nd example instead of updating the values like the 1st example presumably does(does it?).

No, none of jit operations involve memory until their time of evaluation comes forth.

Is it possible to index elements in an array and update them with arithmetic like native Rust arrays and the C++ version of AF supports? Current AF index methods return a new array of what was indexed, updating those values do not update the original array values. Answer?: The index used to extract values might be able to reapply them with assign_gen(), not sure how this works if you take 2nd and 4th column from a 5 column array(extracts to two columns?) and merge back. Still possibly taking up memory with new array?

C++ syntax is uniquely in a position to give an outlook that appears like individual elements are being indexed where as in the background even a single element indexed is still an arrayfire::array. Rust's Index and IndexMut traits aren't suited for intuitive indexing operations using sequences and Arrays. Thus, I am using macros to provide some syntactic sugar for indexing. There is active work going on as updated on https://github.com/arrayfire/arrayfire-rust/issues/197 for improving the syntax for indexing operations. Fetching view of existing Array can be done via view! macro in master branch, assignment operations is work in progress.

Profiling an ArrayFire application on the GPU: CUDA backend -> NVidia Visual Profiling(NVVP) tool OpenCL? (AMD profiling tools maybe? but these might only be for AMD?) CPU, this like a traditional CPU program?

Yes, CUDA program's CUDA kernels can be profiled using nvprof, nsigh-compute etc. Unfortunately, the situation is not that great for OpenCL as old AMD GPUs are not supported in CodeXL. Traditional tools can be used to profile CPU, to accurately know each function's timing, one can run the program with environment variable AF_SYNCHRONOUS_CALLS set to 1 - this will turn all ArrayFire API calls to be synchronous - quicker way to run profiler on CPU backend.

General thumb rule is NOT to use indexing(index_gen, lookup, locate etc. ) to access individual elements from GPU memory. They are very expensive operations and should be avoided. These should be listed in a table somewhere. I ran into a concerning issue where it seems using locate() too frequently is the main cause, as mentioned here and upstream.

There has been recent addition to rust documentation of arrayfire wrapper - tutorial book. It is not comprehensive but we keep updating it with new info. The indexing tutorial does mention about performance issues associated with individual access of elements as a note.

How to get a true/false result for if condition on the host/CPU side instead of an Array result.

As you have found out, the reduction functions without _all suffix run only along a given dimension, thus the result of such operation is still an Array, thus the output returned is also an Array. All reduction functions with _all function doesn't take dimension as they reduce the entire data regardless of the shape of the Array, thus output returned is always a scalar. I think the intent of output types is clear and intuitive.

The reason the returned scalar is is double is different and is related to overflow issues with certain operations and types. To convert that result to boolean on rust side, we can always do as bool which won't give any issues as long as the f64 value once is converting into a boolean is result of some boolean reduction operation.

Apart from the book and other updates to docs, we have also created a separate repo that has information of API comparisons from different libraries w.r.t ArrayFire, API call comparison across wrappers of ArrayFire etc. https://github.com/arrayfire/arrayfire-api-cheat-sheet Please check that out also.

polarathene commented 3 years ago

@9prady9 awesome, thanks for taking the time to respond :)

I don't know when I would have the time, but I'd still like to port the CPU rust code I have for FNV1a-32 hashing function with the permutation generator to ArrayFire with a good way to check the GPU computations for matches("hello" input bytes to "4f9f2cab" output matching equality to the same "4f9f2cab" from a provided variable or array value prior to computation to search for original string(s)).

Example to port (Rust Playground):

// Purpose, given the target hash value 0x4f9f2cab, find an input that creates it.
fn main() {
    const TARGET_KEY: u32 = 0x4f9f2cab;

    // Some loop logic, that iterates through permutations/inputs, until a
    // hash result matches the one a value is being searched for
    let not_matched = hash_algorithm(vec![0xca, 0xfe, 0xba, 0xbe]);
    let matched = hash_algorithm(permutator());

    assert_ne!(not_matched, TARGET_KEY); // No match keep going
    assert_eq!(matched, TARGET_KEY); // A match to the target is found, stop

    // At this point, the input bytes or "hello" string would need to be
    // matched to the computed hash, as we're interesting in knowing that
    // "hello"(or it's bytes) is the underlying value for the hash: 0x4f9f2cab
    // If batch processing an array of inputs to compute hashes, then an index
    // should be sufficient.

    // Inform the user about the matched result:
    println!("Matched hash: {:x} to value: {:?}", matched, b"hello");
}

// Generates permutations for a keyspace and length
// eg charset "abc" len 3 -> aaa, aab, aac, aba, abb, abc, acc, and so forth
// Actual permutator logic not below, example just returning "hello" as a permutation
fn permutator() -> Vec<u8> {
    // ['h' , 'e' , 'l' , 'l' , 'o' ]
    // [0x68, 0x65, 0x6c, 0x6c, 0x6f]
    // [104 , 101 , 108 , 108 , 111 ]
    let key: Vec<u8> = "hello".bytes().collect();
    key
}

// FNV-1a (32-bit) hashing algorithm
const FNV_OFFSET: u32 = 2166136261;
const FNV_PRIME: u32 = 16777619;
fn hash_algorithm(key: Vec<u8>) -> u32 {
    // Initialize hash with seed value
    let mut hash = FNV_OFFSET;

    // Iterate through bytes folding into a u32 value (hash)
    for byte in key.iter() {
        hash ^= *byte as u32;
        hash = hash.wrapping_mul(FNV_PRIME);
    }

    hash
}

Ignoring the permutator, the hash_algorithm is pretty simple for ArrayFire, if I remember right, I had trouble doing it efficiently on the GPU where the CPU performed better due to data transfers to handle conditional checks for if a matching hash had been found, followed by identifying what the input value was.

Should serve as a good example of where bottlenecks can be hit if not approached right when porting for GPU compute, and how to handle such situations?

If I recall correctly, the GPU could compute through a large set of inputs(or generating the inputs within the GPU instead of transferring them) far faster than the CPU, it was just a problem when you need to check for a match and if one is found stop processing and identify the input that created that target output value.

9prady9 commented 3 years ago

@polarathene The FNV-1a algorithm is classic for loop algorithm where each iteration's input is dependent on the previous iteration's output unless there is a parallelized version that I am not unaware of. I doubt FNV-1a can be parallelized.

polarathene commented 3 years ago

I doubt FNV-1a can be parallelized.

I take it you're referring to parallelization on a single input, instead of a larger array of inputs to each process iteratively as the algorithm works.

let inputs = [
  "aaa",
  "aab",
  "aac",
  "aba",
  "abb",
  "abc",
  // ...
]

When I opened this issue years ago, I had several GB of inputs in GPU memory that I used ArrayFire to generate for a pass, it would then run each input through the algorithm individually(in parallel to one another), and then I could send the resulting hashes back to the CPU and check for any hashes that match the one I am searching for to know what value created that hash.

I could then generate the next set of inputs to do another pass through the hashing algorithm of choice and compare until I found a match(or all matches if searching for several hashes).

ArrayFire is capable of this, the slowness was the post-processing part for me. I can only recall performance suffering from having to do a device transfer of results back for the CPU to check. My ArrayFire program handled compute on the GPU well, just not conditional logic if I remember right.

9prady9 commented 3 years ago

That is true, I was referring to parallelization of the algorithm w.r.t single input stream of bytes.

Sure, hashing on a batch of inputs is trivially parallel, not doubt about that. However, at the moment there is no efficient way to do fnv-1a hash on even one input using ArrayFire. The reason I say that is accessing individual elements from GPU memory using indexing is going to be very inefficient, and it would have to individual element access because of data dependency across iterations.

Having said that, a custom kernel is going to be different approach all together and can most likely be more efficient. If you are requesting hashing support via a function in upstream - ArrayFire, then the request has to be moved an issue on upstream repository.

I am curious as to how you managed to do FNV-1a on single input byte stream using ArrayFire efficiently. Is there a chance you have your old code archived somewhere ?

polarathene commented 3 years ago

I am curious as to how you managed to do FNV-1a on single input byte stream using ArrayFire efficiently. Is there a chance you have your old code archived somewhere ?

If I remember right, I believe I did something like take my charset ("abc" in this case) and created the permutations of the charset against the length(keyspace) where each new byte/index is a column, and the first column, then I could run the loop iterating against the columns.

I should have the old code somewhere I'll try to locate it for you.

If you are requesting hashing support via a function in upstream - ArrayFire, then the request has to be moved an issue on upstream repository.

No feature request afaik, I just wanted to know how to best handle this type of computation where I am needing to keep processing/iterating through permutations until a result being searched for is found, and be able to identify that input. The hashing algorithm itself shouldn't matter. If I can find the code it should make more sense :)

polarathene commented 3 years ago

@9prady9 My old code was in a bit of a messy state and I had some trouble getting it to run, but I've pieced together this example that seems to work now, it's only processing on 3 inputs instead of GB of permutations being generated on the GPU, but should show one of the ways I approached the hashing algorithm with ArrayFire:

use arrayfire as af;

fn main() {
    // af::set_backend(Backend::CUDA);//Backend::OPENCL;//Backend::DEFAULT);
    af::init();
    af::info();

    /////////////////////////////////
    // Generate string inputs to hash
    /////////////////////////////////
    // Inputs are collapsed into a sequence/stream of bytes, single array dimension
    let test_strings = vec!["hello", "howdy", "hallo"];
    let test_bytes = test_strings.iter()
        .flat_map(|s| s.as_bytes())
        .cloned()
        .collect::<Vec<u8>>();
    // println!("test_strings as byte array: {:?}", test_bytes);
    // [104, 101, 108, 108, 111, 104, 111, 119, 100, 121, 104, 97, 108, 108, 111]

    ///////////////////////////
    // Initialize values for AF
    ///////////////////////////
    // Used to size AF array dimensions
    let input_len = test_strings[0].len() as u64;
    let num_inputs = test_strings.len() as u64;

    // Convert to an ArrayFire 2D array (Column Major)
    // 5x3 col(length of bytes for input) x row(number of inputs)
    let inputs_dims = af::Dim4::new(&[input_len, num_inputs, 1, 1]); // [5, 3, 1, 1]
    let inputs = af::Array::new(&test_bytes, inputs_dims);
    // Each input now has it's bytes in it's own column
    // print(&inputs);
    // [5 3 1 1]
    //    104        104        104 
    //    101        111         97 
    //    108        119        108 
    //    108        100        108 
    //    111        121        111 

    let fnv_dims = af::Dim4::new(&[1, num_inputs, 1, 1]); // [1, 3, 1, 1]

    ////////////////////////////
    // Compute hashes for inputs
    ////////////////////////////
    let hashes = fnv1a32_gpu(&inputs, fnv_dims);
    // print(&hashes);
    // [1 3 1 1]
    // 1335831723 3497709110 4182196071
    // println!("hashes: {:x} | {:x} | {:x}", 1_335_831_723_u32, 3_497_709_110_u32, 4_182_196_071_u32);
    // 4f9f2cab | d07ace36 | f9473f67

    ///////////////////////
    // Find matching hashes
    ///////////////////////
    // Creates an AF array filled with the same target value to match for,
    let target_hashes: af::Array<u32> = af::constant(0x4f9f2cab as u32, fnv_dims);
    // let target_hashes: af::Array<u32> = af::Array::new(&[0xf9473f67, 0x4f9f2cab], fnv_dims);
    let matches = check_for_matches(hashes, target_hashes);
    // println!("matched: {:?}", match_data);
    // matched: [1335831723]
    for matched in matches {
        println!("Matched the hash: {:x}", &matched);
    };
    // Matched the hash: 4f9f2cab
}

// FNV-1a (32-bit) hashing algorithm
const FNV_OFFSET: u32 = 2_166_136_261;
const FNV_PRIME: u32 = 16_777_619;
pub fn fnv1a32_gpu(inputs: &af::Array<u8>, fnv_dims: af::Dim4) -> af::Array<u32> {
    let input_len = inputs.dims().get()[0];

    let mut hashes = af::constant(FNV_OFFSET, fnv_dims);
    let prime = af::constant(FNV_PRIME, fnv_dims);

    // Iterates through all inputs(columns) in parallel a byte each at a time(row)
    for row_index in 0..input_len {
        hashes = (hashes ^ af::row(&inputs, row_index)) * &prime;
    }

    hashes
}

fn filter_matches(array: &af::Array<u32>, bools: &af::Array<bool>) -> af::Array<u32> {
    let indices = &af::locate(bools);
    let mut idxr = af::Indexer::default();
    idxr.set_index(indices, 0, None);
    af::index_gen(array, idxr)
}

fn check_for_matches(hashes: af::Array<u32>, target_hashes: af::Array<u32>) -> Vec<u32> {
    // If we computed the same hash value, then `eq()` will let us know which values matched
    let is_matched: af::Array<bool> = af::eq(&hashes, &target_hashes, false);

    // Only keep results that were matched
    let result = filter_matches(&hashes, &is_matched);

    // Transfer the matches to the host CPU to access
    let length = result.elements() as usize;
    let mut match_data: Vec<u32> = vec![0; length];
    result.host::<u32>(&mut match_data);

    match_data
}

As the data input to process is finite and not going to take long, there's no logic for continuing/stopping the computation if a match is found. I can try to get my GPU permutator code working again and share that if helpful, I think for the long duration processing, I wanted to create a streaming iterator and was blocked by GAT(Generic Associated Types) which is still not implemented in Rust.

Perhaps the above could still be refined into a useful example for ArrayFire? I could provide a similar non-ArrayFire version if you see value in this as an example of how to think/approach writing for the GPU with ArrayFire?

9prady9 commented 3 years ago

Thank you for digging up the old code.

As I said earlier, you are doing individual accesses here in the below section of code

    // Iterates through all inputs(columns) in parallel a byte each at a time(row)
    for row_index in 0..input_len {
        hashes = (hashes ^ af::row(&inputs, row_index)) * &prime;
    }

Sure, all these operations are async operations but access to bytes of single column of input is being accessed individually in each iteration of the loop which is what I said earlier is going to be inefficient. As in, it can done in a better way with a kernel written especially for this kind of hashing algorithm to run on a batch of inputs.

polarathene commented 3 years ago

Oh? I had thought that it was taking the whole row and computing against that, instead of each being a separate access. I could always rework the data to use columns instead of rows, eg 3x5? Or is that the same problem?

Again, this particular part (hash algorithm) wasn't the performance concern I was having trouble with, it was identifying matches which involved transferring data to the host each time to lookup the matches to decide what to do next. In the above snippet, I am only informed there was a match, not what the input value was.

Following approach instead retains an index before removing the unmatched elements:

fn check_for_matches(hashes: af::Array<u32>, target_hashes: af::Array<u32>) -> Vec<(usize, u32)> {
    // If we computed the same hash value, then `eq()` will let us know which values matched
    let is_matched: af::Array<bool> = af::eq(&hashes, &target_hashes, false);

    // Any unmatched value becomes 0
    let result = af::mul(&hashes, &is_matched, false);

    // Transfer the matches to the host CPU to access
    let length = result.elements() as usize;
    let mut match_data: Vec<u32> = vec![0; length];
    result.host::<u32>(&mut match_data);

    let results: Vec<(usize, u32)> = match_data.into_iter()
      .enumerate()
      .filter(|&(_,val)| val!=0)
      .collect();
    // println!("results: {:?}", results);
    // results: [(0, 1335831723)]

    results
}

// ... main()
    let matches = check_for_matches(hashes, target_hashes);
    for (index, value) in matches {
        println!(
            "Matched the hash: {:x}, original input was: {:?}", 
            value, 
            test_strings[index]
        );
    };
    // Hash is: 4f9f2cab, input: "hello"

According to my notes, multiplying the bool array was faster than the locate approach too.


it can done in a better way with a kernel written especially for this kind of hashing algorithm to run on a batch of inputs.

How does one go about writing these kernels? I take it that's outside of ArrayFire? Is that direct OpenCL/CUDA code? One that would be of interest is AES CBC 256-bit decryption, I guess that's something where a custom kernel would be useful too?

9prady9 commented 3 years ago

Oh? I had thought that it was taking the whole row and computing against that, instead of each being a separate access. I could always rework the data to use columns instead of rows, eg 3x5? Or is that the same problem?

Well, it is running on GPU except that the memory access is not continuous. Notice the elements of single input are being handled in the host side for loop. Each such value for every input in every iteration is input_len apart in GPU memory. It is valid algorithm, not efficient in terms of GPU memory access. If you transpose the input such that bytes of one stream are along dimension 1 i.e. row in ArrayFire (ArrayFire follows column major order), then each run that is handling the xor and * operations for a single byte for all input streams is accessing continuous memory. This should give a better runtime than earlier.

Having said that, I still believe a custom kernel can be written for this algorithm that can be faster than my above suggestion.

locate

polarathene commented 3 years ago

Having said that, I still believe a custom kernel can be written for this algorithm that can be faster than my above suggestion.

Is there a example / documentation on approaching that?

What is the runtime you are getting for locate function?

I don't have specific time logged unfortunately, just the old comment about it.

what is the size of these Arrays, target_hashes/computed_hashes typically?

The input values would iterate through lengths, so for a given length, all inputs were that fixed length each, typical lengths would be 6-12 if I recall correctly. For the Jenkins hash which is more involved, it was being used on filepaths as inputs, those would be much longer but permutations were often a similar 6-12 length of the filename with the paths only permutating known directories.

Actual size of the array with permutations would depend on GPU memory, the number of permutations could be very large and require many hours if not days to brute force, it was similar to a popular software called Hashcat commonly used for password recovery. In my code I was moving permutation generation to the GPU and referred to it as "tiling" the charset, it would fit as many of those tiles to cover the keyspace in GPU memory and operate on in as few passes/iterations of inputs as possible. I have a Nvidia GTX 1070 with 8GB of VRAM, and used 6-8GB I think with GPU computing with high/full usage of the CUDA cores.

Did you time the logic that multiplies with zero, brings back the result to host, then finds the indices ? How much is that time ?

I timed most parts individually, there was a macro for it, possibly one for CPU and another that was ArrayFire specific, I would need to look through that code again. While I could probably run the code again with the timing logic added to compare for you, presently I'm only able to use the CPU backend I think. My system is low on disk space and can't install CUDA as a result, unsure about OpenCL.

9prady9 commented 3 years ago

Is there a example / documentation on approaching that?

I would have to think through it. Top of my head, it would most likely start similar to your OpenCL kernel implementation I believe. But one thing is certain, the data would have to laid out so that GPU global memory accesses from all threads are coalesced from input bytes for different streams are read. That is most obvious part. Then it would probably involve some loop unrolling for processing single input byte stream if we have prior knowledge of minimum number of bytes etc.

The input values would iterate through lengths, so for a given length, all inputs were that fixed length each, typical lengths would be 6-12 if I recall correctly. For the Jenkins hash which is more involved, it was being used on filepaths as inputs, those would be much longer but permutations were often a similar 6-12 length of the filename with the paths only permutating known directories.

I wonder how locate is throttling the operation if single input is so small and actual number of such inputs is huge which are run in parallel as they are independent operations. You should rearranging the data so that access to GPU memory is coalesced as I mentioned here

I timed most parts individually, there was a macro for it, possibly one for CPU and another that was ArrayFire specific, I would need to look through that code again.

If possible, please share the timing code. ArrayFire is better timed as an average of multiple runs given that there is device warmup cost and kernel compilation of any functions used on their first call. Also, multiple JIT operations coalesce into single kernel, if you are timing individual statements you are likely doing more work than necessary - ArrayFire JIT operations are better timed a whole functional unit.

9prady9 commented 3 years ago

Also let us more move this discussion to slack or perhaps another issue as we moved away from the issues original intent some time back :) which is "code examples". Feel free to raise another issue with relevant details or we can discussion slack DM.

polarathene commented 3 years ago

as we moved away from the issues original intent some time back :) which is "code examples".

Probably would make more sense to rename this issue and recreate the "Add code examples" one as a new issue at this point :sweat_smile:

I've created a new issue regarding where the bulk of this thread ended up focusing on, I'll try find some time to meet your requests for the additional code and data layout change. Do you want me to continue sharing code fences on the issue thread comments, or would a repo be preferred?

Feel free to raise another issue with relevant details or we can discussion slack DM.

Not a slack user, I am fine with Discord, Facebook, email or a forum if you have one? Otherwise the github issue works well for me and might be helpful to someone else (although I guess this is a bit of a niche topic).

9prady9 commented 3 years ago

Probably would make more sense to rename this issue and recreate the "Add code examples" one as a new issue at this point :sweat_smile:

Perhaps :) , I meant to have this issue as placeholder to keep updating examples, both standalone and documentation snippets - not specifically about any one particular one. So, I basically don't expect this issue to be closed anytime soon - at least not until every API has at least one example in documentation.

Thanks for moving the discussion to specific issue. GitHub issue works fine for us, just didn't wanted too much of specific discussion in placeholder issue.

polarathene commented 3 years ago

I'd appreciate keeping the history of the discussion here, so if wanting to hide the lengthy discussion, please consider marking as outdated/off-topic rather than deleting :)

9prady9 commented 3 years ago

I'd appreciate keeping the history of the discussion here, so if wanting to hide the lengthy discussion, please consider marking as outdated/off-topic rather than deleting :)

:+1: I am not deleting anything, just wanted to move it to a separate issue so that this issue doesn't get too populated with auxiliary/tangential conversation.