This PR fixes a silly mistake in old code: the reduction tree was designed to add up "columns" of the provided memories in parallel to get intermediate sums, and to then add up the intermediate sums to get the overall sum. It was incorrectly doing all of this in sequence.
This PR fixes a silly mistake in old code: the reduction tree was designed to add up "columns" of the provided memories in parallel to get intermediate sums, and to then add up the intermediate sums to get the overall sum. It was incorrectly doing all of this in sequence.