xGETC2 order of loops in finding max element of submatrix

Goddan-wq commented 4 months ago

Hello everyone

I've noticed that order of loops in xGETC2 is not optimal. It goes through the rows of matrix. But our matrices have column layout. Is not it better to change the order of loops, so we get more cache friendly algorithm?

It can cause difference in case, if we have two equal maximum elements in matrix, so we get different LU decomposition and different ipiv/jpiv arrays. But it seems, that both of this results should be correct

For example, see SRC/sgetc2.f, lines 178-187:

XMAX = ZERO 
DO 20 IP = I, N
    DO 10 JP = I, N
        IF( ABS( A( IP, JP ) ).GE.XMAX ) THEN
            XMAX = ABS( A( IP, JP ) )
            IPV = IP
            JPV = JP
        END IF
    CONTINUE
CONTINUE

Changing the order of loops makes IP continuous index, and cache works better

langou commented 4 months ago

These are three good points. (1) Replacing this (I,J) loop with a (J,I) loop should give better performance for column-major matrices. (2) Changing the loops (from (I,J) to (J,I)) might change the chosen pivot in case of a draw between two entries, and so might change the permutation. (3) These outputs (while different) are equally valid complete pivoting factorization P A Q = L U.

It is not clear how much performance gain there would be, if any. That being said, I feel that, whenever possible, in LAPACK, we want to write our loops with column major in mind and so, just for sake of consistency, I feel it is better to have (J,I) loop than (I,J) loop here. It would be nice to know if there is a practical gain in practice.

It is not clear how problematic a routine with (J,I) loop would be in the current software stack. For example, would the (J,I) loop variant pass our own LAPACK Test suite? But more generally would it be a problem for some applications who expect the (I,J) loop in case of a tie. I do not know.

My opinion: All in all, I would be fine with reversing the loops from (I,J) - current, to (J,I) - proposed. If it passes the LAPACK test suite, then I think that should be fine and we could merge this.

Goddan-wq commented 4 months ago

Ok than, I will check if tests will be passed after this optimization and make pull request. Otherwise I'll tell that tests are not passed

Actually, i forget to notice couple of things about this optimization. Yeah, firstly it's better for cache. Secondly, when you make 'I' continuous index, it is easier to vectorize this nested loops. As far as I remember, the function spends more than 50% of the time in this nest of loops, so it is pretty hot place. Of course, it depends on architecture, level of optimization of other functions and order of input matrices. But in my practice, I saw that it's hot place. Actually, I have an opportunity to check performance on couple of architectures, I can share results here. Anyway, I think it is important to change the order of loops here

Thanks for your answer

langou commented 4 months ago

fixed with #1023

Reference-LAPACK / lapack

xGETC2 order of loops in finding max element of submatrix #1021