argon2: parallel memory view abstraction

The previous parallel implementation of Argon2 was removed in #247. There were soundness concerns about how it handled mutable aliasing, namely that it's unsound for mutable aliases to exist even if they aren't accessed.

This is a discussion issue for how to build a sound abstraction for use in a new parallel implementation of Argon2. Please make sure to read through this issue before attempting a new parallel implementation.

Argon2 memory layout

The memory layout used by Argon2 is somewhat difficult to model in Rust due to how it's arranged, though this choice of arrangement is due to its goals of sequential memory hardness.

https://datatracker.ietf.org/doc/html/rfc9106#section-3.4

3.4. Indexing

To enable parallel block computation, we further partition the memory
matrix into SL = 4 vertical slices. The intersection of a slice and a
lane is called a segment, which has a length of q/SL. Segments of the
same slice can be computed in parallel and do not reference blocks
from each other. All other blocks can be referenced.

    slice 0    slice 1    slice 2    slice 3
    ___/\___   ___/\___   ___/\___   ___/\___
   /        \ /        \ /        \ /        \
  +----------+----------+----------+----------+
  |          |          |          |          | > lane 0
  +----------+----------+----------+----------+
  |          |          |          |          | > lane 1
  +----------+----------+----------+----------+
  |          |          |          |          | > lane 2
  +----------+----------+----------+----------+
  |         ...        ...        ...         | ...
  +----------+----------+----------+----------+
  |          |          |          |          | > lane p - 1
  +----------+----------+----------+----------+
Figure 9: Single-Pass Argon2 with p Lanes and 4 Slices

(Note: since slices are an important concept in Rust, to disambiguate I'll throw scare quotes on Argon2 "slices")

The parallel implementation is described as follows in the Argon2 paper:

https://www.password-hashing.net/argon2-specs.pdf

We suggest the following solution for p cores: the entire memory is split into p lanes of l equal slices each, which can be viewed as elements of a (p×l)-matrix Q[i][j]. Consider the class of schemes given by Equation (1).

We modify it as follows: • p invocations to H run in parallel on the first column Q[∗][0] of the memory matrix. Their indexing functions refer to their own slices only; • For each column j > 0, l invocations to H continue to run in parallel, but the indexing functions now may refer not only to their own slice, but also to all jp slices of previous columns Q[∗][0], Q[∗][1], . . . , Q[∗][j−1]. • The last blocks produced in each slice of the last column are XORed.

This scheme prevents mutable aliasing because for each thread in p lanes, it has exclusive mutable access to segment j in its own lane, but has a shared view of the segments of previous lanes.

To "rustify" the above diagram, for j = 2 it would allow the following access:

    slice 0    slice 1      slice 2      slice 3
    ___/\___   ___/\___   _____/\_____   ___/\___
   /        \ /        \ /            \ /        \
  +----------+----------+--------------+----------+
  | &[Block] | &[Block] | &mut [Block] |          | > lane 0
  +----------+----------+--------------+----------+
  | &[Block] | &[Block] | &mut [Block] |          | > lane 1
  +----------+----------+--------------+----------+
  | &[Block] | &[Block] | &mut [Block] |          | > lane 2
  +----------+----------+--------------+----------+

Which is to say, each thread has exclusive &mut [Block] access to the jth segment in their lane, and shared &[Block] access to all segments in all lanes for previous "slices" prior to j.

To make this work, I would suggest having a type like struct Memory which defines iterators that can hand out slice-and-lane-specific "views" of memory while ensuring no mutable aliasing occurs.

This type can use dynamically checked borrow rules to avoid invalid accesses, e.g. if j were 2 and an immutable view of a segment in "slice" 2 were requested, this would cause an error. All of the borrows (i.e. the ones held by the "views") would have to be returned before j could be advanced, at which point a new "view" can be constructed per-lane for the next iteration of j.

RustCrypto / password-hashes

argon2: parallel memory view abstraction #380

Argon2 memory layout