BallisticLA / RandBLAS

A header-only C++ library for sketching in randomized linear algebra
https://randblas.readthedocs.io/en/latest/
Other
64 stars 4 forks source link

Generation of dense sketching operators doesn't return updated RNGState #57

Closed rileyjmurray closed 1 year ago

rileyjmurray commented 1 year ago

Ping @kaiwenhe7. It looks like your big PR a couple of weeks back introduced a bug. The current implementation (renamed to fill_dense_submat_impl a while back) can be found here https://github.com/BallisticLA/RandBLAS/blob/9debb57a34e07263f4c98356f740629cd1106121/RandBLAS/dense.hh#L203-L270

The old implementation can be found right around here: https://github.com/BallisticLA/RandBLAS/commit/29ddfb7d182f990035352885a1721c6dec9bdfc4#diff-454044b3f0993309c60438575edec1d146c1cdfd03f6ca6002c59fb8729bec8cL208

The problem with the new implementation is that the RNGState it returns is based on {c, k}, but c is never updated in the function. Compare this to the old implementation, which did update c, e.g., here: https://github.com/BallisticLA/RandBLAS/commit/29ddfb7d182f990035352885a1721c6dec9bdfc4#diff-454044b3f0993309c60438575edec1d146c1cdfd03f6ca6002c59fb8729bec8cL267

rileyjmurray commented 1 year ago

@TeachRaccooon is going to try working on this first.

kaiwenhe7 commented 1 year ago

Oh yes! In the parallel case, each thread creates its own copy of c. I forgot to update the actual counter at the end. Perhaps it should really simple to just advance c all the way to the final state at the end.

TeachRaccooon commented 1 year ago

@kaiwenhe7 Guess my RandBLAS skills aren't strong enough, wasn't able to fix your code. If you know the solution, please also add a basic test for it so that we make sure this issue is fuly taken care of.

kaiwenhe7 commented 1 year ago

@TeachRaccooon The solution is more like a hack. Since the main body of the code only uses a private copy of c, the actual c is left untouched during the whole process of the sub matrix generation. Given the position of the last entry of the sub matrix n_scols*n_srows, the associated position of the counter can be computed. So before returning the state, I was thinking of just advancing the counter all the way to the final position.

TeachRaccooon commented 1 year ago

@kaiwenhe7 do you mean just literally doing c.incr(n_srows * n_scols); return RNGState<RNG> {c, k};?

kaiwenhe7 commented 1 year ago

@TeachRaccooon Yes. But I think it would be something like c.incr( (int64_t) (ptr + n_scols + (n_srows-1)*n_cols) / RNG::ctr_type::static_size) )

It's the position of the last entry of the sub matrix within the larger random matrix. And since each counter is associated with an array of random numbers of size RND::ctr_type::static_size. The start of the counter is associated with the first entry of the larger random matrix.

burlen commented 1 year ago

@rileyjmurray there's another subtle issue with the new implementation. OpenMP does not specify how loop iterations are divided among threads. There are different ways to handle the case when the number of threads does not evenly divide the loop count. Therefor there is no portable way to calculate the per thread start index that is needed to initialize the counter so that the sequence is the same independent of the number of threads (line 243 above) . This can lead to the same counter values being used in different threads. this implementation might give the correct result when the number of threads evenly divides the number of rows for instance. This will be a very subtle bug to identify and fix later on, since behavior depends on: the number of rows, the number of threads, and the OpenMP implementation. A user may not even notice that different threads are generating the same values. This is why the old implementation did not use OpenMP parallel for, and instead explicitly partitioned loop indices to threads.

rileyjmurray commented 1 year ago

@burlen thanks for those insights. We'll be adding tests to try and stress-test the cases you mentioned. If we can't get them to pass then I'd be happy for us to consider alternative implementations, such as partitioning indices as you suggest.

kaiwenhe7 commented 1 year ago

@burlen @rileyjmurray For OpenMP is it incorrect to assume the for loops are partitioned into blocks where each thread handles a block? I changed the sub matrix generation test so that the dimensions are prime and it seems to pass all the tests so far.

burlen commented 1 year ago

Actually there is a test for this, TestDenseThreads varies the number of threads and checks that the matrix is the same as with 1 thread. Since your code passes that, it must be correct. I think I was wrong about the new code duplicating values and I think I see why. Sorry for the confusion on my part!

OpenMP has different scheduling strategies. static scheduling with a fixed chuck size is what you're describing. to ensure this is what is used you have to add schedule(static) to your directive. if you don't set a chunk size, schedule(static, chunk_size) then there's ambiguity in how the implementation deals with the remainder of iterations when the num threads doesn't divide the loop count evenly. But if you do set a chunk size it may be sub-optimal. And if you don't specify a scheduling strategy explicitly, it's up to the implementation which one to use, it could choose dynamic scheduling. With dynamic scheduling the chunk size defaults to 1, and threads take an new index as soon as they finish processing the current one. In that case indices are processed 1 by 1 in a non deterministic order.

At any rate, I'm seeing now that your code does not depend on how OpenMP partitions the iterations, your using the loop index to back out the counter value. I think this is correct. My bad!

rileyjmurray commented 1 year ago

Resolved by #62.