Chunked Iterators - Githubissues

dimforge / nalgebra

Linear algebra library for Rust.

https://nalgebra.org

Apache License 2.0

3.98k stars 475 forks source link

Chunked Iterators #799

Open Turakar opened 3 years ago

Turakar commented 3 years ago

Hello, thanks for this awesome library! :heart:

Proposal: I want to propose a new function called iter_chunked(&self) and associates (_mut, into_, row, column) allowing for a chunked iteration over the array. The behaviour regarding remainders could be copied from std. The benefit of this new function over the std's chunks_exact would be type level support (i.e. you get a Matrix<N, R, C, S>) and maybe some optimization possibilities.

Use Case: Destructuring a matrix to share it among workers for parallelized processing.

What do you think of this? I am new to this library so it might easily be that I missed a function better suitable for my use case.

Andlon commented 3 years ago

Hi @Turakar! If I understand correctly, you want to be able to iterate over fixed-size blocks of a matrix? For example, for

[A B
 C D]

where A, B, C and D are same-size blocks, your iterator would give the sequence [A, B, C, D] (row-major) or [A, C, B, D] (col-major). Is this what you have in mind?

Implementing custom iterators takes some work, so it's good to have motivating use cases that are likely to benefit many users. So it's great that you already wrote something about this. However, I don't think an iterator of this kind is appropriate for parallelization, as a "standard iterator" (what you find in std) is inherently sequential, with no opportunity for parallelism. If you want to have parallel iterators, it would make sense to look to the rayon library. However, nalgebra does not have any rayon integration at all at the moment.

Perhaps if you share a little bit more about what you want to do, we can find a different way to approach your problem?

Turakar commented 3 years ago

Your proposed behaviour would suffice for my problem. My problem is as follows:

N worker threads, connected via sync::mpsc::channel() in both directions
One main thread distributing the work and collecting the results.
Each worker thread uses AVX to process 4 work packages in parallel

If a set of work packages (a (N*4) x M matrix, where M is the number of values per work package) arrives at the main thread, it should iterate through all Senders and send a 4 x M matrix to each one of them. (After that there is a second loop waiting for the results, but this does not affect my proposal.)

I had problems with rayon because I use thread-local storage accross executions and I could not figure out how to integrate this with rayon.

Andlon commented 3 years ago

Thanks for explaining your use case a bit more! I'll leave it to @sebcrozet to say if he wants to have this in nalgebra.

My intermediate reaction (and personal opinion) is that this is a bit of a niche use case, and I'm not sure if it balances out the need for tests, documentation and maintenance, given that iterating over indices and picking out blocks with e.g. fixed_slice, generic_slice, or get((.., ..)) is not too difficult and gives a lot more flexibility.

As for thread locals: I also found working with rayon and thread local variables pretty difficult until I discovered the thread-local crate, which really made it much easier to have thread-local state for each thread!

Turakar commented 3 years ago

Based on my knowledge, iteration with an intege variable (e.g. for i in 0..N) is an anti-pattern in rust due to performance. But I certainly understand that the complexity of this iterator is comparingly high. Thanks for the tip with the thread-local crate.

Andlon commented 3 years ago

You are definitely right that "indexed iteration" is often discouraged for simple iteration schemes. One reason is performance, but another reason is that it often anyway leads to simpler code. On the other hand, when dealing with matrices, it's in my opinion natural to describe algorithms in terms of (row, col) indices into the matrix, whereas iterator APIs may be ambiguous or require intimate knowledge of how the specific iterator accesses the matrix to understand exactly what's going on. Unless your blocks are very small, I suppose the overhead of indexing here should be minimal. Of course, as with everything else performance-related, you'd need to measure this first ;-)

I might clarify that I'm not against a nice block-matrix iteration API, I just think it's worthwhile to consider the cost-benefit ratio for nalgebra users and maintainers!