column-major access in generic triangular solves?

The generic triangular solves (naivesub!) in 65937f43938e8b4d644f13931b7b11a9bf63fffe/base/linalg/triangular.jl#L765-L831 use row-major access. Given column-major storage is the norm in numerical linear algebra, should these methods instead use column-major access? To illustrate the performance impact, this code:

function colmajnaivesub!(A::LowerTriangular, b::AbstractVector, x::AbstractVector = b)
    n = size(A, 2)
    if !(n == length(b) == length(x))
        throw(DimensionMismatch("second dimension of left hand side A, $n, length of output x, $(length(x)), and length of right hand side b, $(length(b)), must be equal"))
    end
    for j in 1:n
        A[j,j] == zero(A[j,j]) && throw(SingularException(j))
        xj = x[j] = A[j,j] \ b[j]
        for i in j+1:n
            b[i] -= A[i,j] * xj
        end
    end
    x
end

m = 5000;
bo = rand(m); b = bo;
L = LowerTriangular(eye(m) + randn(m, m)/(2m));

using Benchmarks
Benchmarks.@benchmarkable(benchbase, copy!(b, bo), Base.LinAlg.naivesub!(L, b), nothing)
Benchmarks.@benchmarkable(benchcolmaj, copy!(b, bo), colmajnaivesub!(L, b), nothing)
println("Base meth"); Benchmarks.execute(benchbase)
println("Col-major"); Benchmarks.execute(benchcolmaj)

yields

Base meth
================ Benchmark Results ========================
     Time per evaluation: 142.07 ms [139.38 ms, 144.77 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 68
   Number of evaluations: 68
 Time spent benchmarking: 9.89 s

Col-major
================ Benchmark Results ========================
     Time per evaluation: 31.81 ms [30.55 ms, 33.06 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 100
   Number of evaluations: 100
 Time spent benchmarking: 3.26 s

I imagine I'm missing something important? If not and impressions are positive, I would be happy to prepare a PR covering the relevant methods. Perhaps eventually both column- and row-major storage could be handled efficiently via dispatch on an access-order trait or the like? Best!

JuliaLang / LinearAlgebra.jl

column-major access in generic triangular solves? #293