[Performance issue] Temporary array not removed

kiranchandramohan commented 4 years ago

Flang inserts a temporary array in the computation of the following code in the program given below. val2(1:dim1,i) = val2(1:dim1,i)*val1(i)

There is no temporary needed here since even if val1 aliases with val2, it does not affect the evaluation. The evaluation of the array assignment expression here depends only on a single value in val1, i.e. val1(i). As long as this value is read and stored in a register/temporary location, even if val1(i) is written into during the evaluation of the array assignment it is fine and the results do not change. For val2 the lhs and rhs refer to the same section and indices, so each element can be multiplied by val1(i) and updated in place without a temporary array.

subroutine cdata(n,dim1,val2,val1)
  implicit none

  integer :: n,dim1
  real(4), pointer :: val2(:,:)
  real(4), pointer :: val1(:)
  integer :: i

  do i=1,n
    val2(1:dim1,i) = val2(1:dim1,i)*val1(i)
  end do

end subroutine cdata

kiranchandramohan commented 4 years ago

Background Fortran assignment has the syntax variable = expr. Rules permit both variable and expr to reference any portion of the variable. The requirement is that the evaluation of expr should not be affected by the assignment to variable[1]. For e.g, If the initial value of an array ARR was set as (/1, 2, 3, 4, 5/) and if there is an array assignment ARR(2:5) = ARR(1:4) then the contents of array ARR should be (/1, 1, 2, 3, 4/) and not (/1, 1, 1, 1, 1/).

R1032 assignment-stmt is variable = expr10.2.1.34 Both variable and expr may contain references to any
portion of the variable. 
NOTE 2 For example, in the character intrinsic assignment statement: STRING (2:5) = STRING (1:4)the
assignment of the first character of STRING to the second character does not affect the evaluation of
STRING (1:4). If the value of STRING prior to the assignment was ’ABCDEF’, the value following the
assignment is ’AABCDF’.

Fortran compilers typically achieve this by inserting a temporary copy of the array for the expression on the RHS and then assign the variable from this temporary copy. So the code generated might look like the following. Note the temporary array created tmp_ARR. Unlike in the example below, typically this temporary is created on the heap.

integer :: ARR(5) = (/1, 2, 3, 4, 5/)
integer :: tmp_ARR(5)
integer :: i, j
do i=1,5
  tmp_ARR(i) = ARR(i)
end do
j = 1
do i=2,5
  ARR(i) = tmp_ARR(j)
  j = j + 1
end do

The creation of the temporary on the heap and the copying involved can affect performance particularly if this happens inside a loop. So compilers try to optimize these temporaries away when they are not needed in situations like, 1) When there is no overlap between LHS and RHS. e.g: ARR(1:2) = ARR(3:5) 2) When there is full overlap between LHS and RHS. e.g: ARR(1:5) = ARR(1:5) + constant

These optimization require good dependence analysis. If the arrays involved are pointers then the dependence analysis task becomes even more difficult since pointers can alias. For the Fortran program given below gfortran does not have a temporary array and copy but Flang has.

subroutine cdata(n,dim1,val2,val1)
  implicit none
  integer :: n,dim1
  real(4), pointer :: val2(:,:)
  real(4), pointer :: val1(:)
  integer :: i
  do i=1,n
    val2(1:dim1,i) = val2(1:dim1,i)*val1(i)
  end do
end subroutine cdata

There is no temporary needed here since even if val1 aliases with val2, it does not affect the evaluation. The evaluation of the array assignment expression here depends only on a single value in val1, i.e. val1(i). As long as this value is read and stored in a register/temporary location, even if val1(i) is written into during the evaluation of the array assignment it is fine and the results do not change. For val2 the lhs and rhs refer to the same section and indices, so each element can be multiplied by val1(i) and updated in place without a temporary array.

val2(1:dim1,i) = val2(1:dim1,i)*val1(i)

The backdoor flag, -Hx,4,0x100000, can be used to not add temporary arrays (by turning of dependence analysis for array-assignments/forall). But as expected some programs which need temporary arrays will not work correctly. For e.g. the following flang unit tests will fail.

Flang :: f90_correct/lit/ka65.sh
Flang :: f90_correct/lit/ka66.sh
Flang :: f90_correct/lit/ka67.sh
Flang :: f90_correct/lit/ka68.sh
Flang :: f90_correct/lit/ka69.sh

The forall/array assignment dependence analysis should be improved to remove the requirement to add temporary arrays like in the subroutine cdata sample program given above. See code in tools/flang1/flang1exe/outconv.c.

if (!XBIT(4, 0x100000))
  forall_dependency_analyze();

michalpasztamobica commented 4 years ago

After spending a while looking at how these things are working here is my current understanding of the situation.

The crucial line in forall_dependency function is L4029 containing a call to is_dependent function. The latter consists of two steps making use of a shared global dep structure:

build_dlist() which builds a list of potentially dependent names basically just adding LHS of our do loop somehow transformed into a forall (not sure where and how, but the change makes sense to me)
expr_dependent_check() which firstly recurrently goes through the AST, starting at RHS of the forall statement until it finds an actual name and secondly - compares that name against all potentially dependent names found by build_dlist().

The core logic of the criteria behind every check are explained in a comment:

      /* can 'esptr' overlap with 'dsptr'? */
      /* yes if they are the same variable,
       * esptr is a pointer and dsptr is a pointer or target of same type,
       * esptr is a target and dsptr is a pointer of same type,
       * both variables and equivalenced (handled later)
       */

I checked that the program provided by @kiranchandramohan correctly figures out we're dealing with pointers (sptr 627 and 628) and then hit this line, effectively deciding there is an overlap and returning TRUE all the way down stack to is_dependent.

This comment is asking for work, but I think we're missing some logic somewhere else. I thought we might need something like an additional wisely-applied call to dd_array_conflict(), similar to the one in L428, but actually this function also returns a value suggesting that there's a conflict between arrays. I need to investigate its logic more.

michalpasztamobica commented 3 years ago

@kiranchandramohan , would you be able to tell me how exactly to verify if the temporary array is created or not (other than with breakpoints in the code)? I don't see any change in the qdbf dumps, even if I skip the code adding the temp and see the ka** tests failing. Is there any way to view the intermediate code with the temp array inserted, like the code you pasted above?

Looking into dd_array_conflict function I noticed it counts on the AST leaves to be of type A_ID, while in our case they are A_SUBSCR and hence the return value indicates a dependency even though there is none. I am not sure if this is the right track, but it might be that the A_SUSBCR should be taken into consideration and resolved within the check somehow.

I tried to entirely skip the is_dependent check, relying on the subscr_dependent function called from scatter_dependency in forall_lhs_indirection, but this is not sufficient to correctly generate the temp for the ka** testcases.

@kiranchandramohan , any remarks or suggestions on this will be most welcome.

kiranchandramohan commented 3 years ago

Are you looking to check various intermediate representations? If so some instructions, might be obvious but just giving here for completeness. These are in order of appearance, so only the first one might matter for you.

ILM : Will have CALL instructions to f90_alloc. CALL ...... ; f90_alloc04a_i8. Use -Mq,0,1 -Mq,4,1. ILI : Will have Jump SubRoutine instructions. JSR 322~. Use -Mq,0,1 -Mq,10,2 -Mq,10,1 LLVM IR : Calls to @f90_alloc*. Use -S -emit-llvm or -save-temps. Assembly : Calls to malloc. Use -S.

michalpasztamobica commented 3 years ago

Current procedure decides that a temporary is needed quite quickly and isn't even running the full dependency analysis tools available in Flang.

These are the exact current steps:

Check that the array's types match L369.
Check that they are both pointers, L375.

And that's all that is being checked right now. We then enter this if clause and return TRUE (e.i. there is a conflict). I think we need to have this kind of additional check:

--- a/tools/flang1/flang1exe/iterat.c
+++ b/tools/flang1/flang1exe/iterat.c
@@ -432,6 +432,11 @@ name_dependent_check(int expr)
               /* flg.depchk MORE WORK HERE */
             }
           }
+          if (overlap == 2 && dep.forall_list) {
+            if (!dd_array_conflict(A_LISTG(dep.forall_list), SRC_ARRAY, SINK_ARRAY, -1)) {
+              overlap = 0;
+            }
+          }
           if (overlap > 0) {
             int i, needmsg;
             if (dep.eavl != dep.davl) {

The problem is that I can't figure out what arguments to pass as SRC_ARRAY and SINK_ARRAY. I keep running into mkSub function returning FALSE (effectively judging that the array subscripts are non-linear). I believe this is me passing incorrect parameters to the dd_array_conflict function, rather than an error in the code. I tried a few different combinations, but keep getting the same error. The function operates on AST indices, but this peace of algorithm is written using symbol SPTRs (or rather a mixture of AST and STB).

michalpasztamobica commented 3 years ago

@kiranchandramohan , I am stuck trying to dereference the pointers and pass them to dd_array_conflict(I asked on the #debugging channel on Slack, but got no answer there...).

If there is a symbol which has stb.stg_base[s].f5 (covered by POINTERG/P(s) macros) set to 1 and effectively is a pointer - how do I figure out the SPTR or the AST that this pointer points to? Unfortunately, both ASSOC_PTRG and PTR_TARGETG macros return 0 (and the corresponding -P macros are never called). I also tried sym_get_ptr but it created a new symbol instead (does not find the existing pointer). I also found a find_pointer_target function, but this seems to be intendend (as per doxygen doc) to work only for pointer assingments.

This is probably trivial for someone more experienced with flang, but I can't figure out the right function.

kiranchandramohan commented 3 years ago

Would midnum have the information you want? MIDNUM | Contains symbol table pointer (sptr) of the variable's pointer variable

michalpasztamobica commented 3 years ago

Thanks, @kiranchandramohan . Sounds like this is what I was looking for, I will try it out.

michalpasztamobica commented 3 years ago

MIDNUM worked indeed giving me: val1$p and val2$p, but these both seem to be some temporary variables (ST_VAR, SC_LOCAL and symlk=1) and not the arrays I am looking for (arr1 and arr2), so I need to search further. Then I will also need to figure out if the subscripts checked by dd_array_conflict will be the right ones.

michalpasztamobica commented 3 years ago

@kiranchandramohan , would you be able to give me another two hints?

Is there a way to map an SPTR to an AST?
If I have a function-local pointer variable (val1$p) - is there a way to find the SPTR of the array that it actually points to (arr1)?

I am trying to run the dd_array_conflict function with correct parameters. They need to be ASTs and I believe that we may have to resolve the pointers to see where they point to and pass those ASTs to the function, to make it work right. But I can't figure out these seemingly basic operations. I'll be grateful for help.

michalpasztamobica commented 3 years ago

I tried to implement a function that would search for an AST with a given SPTR assigned. There seems to be no such solution. I failed in trying to use macros to access all existing entries in astb, as I can't figure out how to tell the loop has reached the last existing element. I am now trying to use the ast_traverse_all function for this purpose. However, I am in serious doubt if despite an extensive search I haven't missed something that is already there for this purpose. If there isn't, then perhaps what I am doing is somehow against the compiler's logic... I understand AST is normally much shorter than the symbol table, so perhaps there is no way to match AST to an SPTR (alhtough it is possible the other way round)?

kiranchandramohan commented 3 years ago

Apologies, I will be able to look into this next week only and i don't know the answer from the top of my head.

Can @gklimowicz or @mleair help here?

mleair commented 3 years ago

@kiranchandramohan , would you be able to give me another two hints?

Is there a way to map an SPTR to an AST?

If I have a function-local pointer variable (val1$p) - is there a way to find the SPTR of the array that it actually points to (arr1)?

I'm not sure how much I can help here, but I'll try to answer the two questions.

Question 1: Since you are traversing ASTs, you probably can use the following check, where ast is an AST that you are checking and sptr is the SPTR you are looking for:

if (ast_is_sym(ast) && memsym_of_ast(ast) == sptr) { // found it! }

Question 2: I'm not sure how you can recognize this particular use case...

It's a hard problem since a pointer can point to an unknown entity. For example, if the pointer is a dummy argument, then you will not necessarily know what it is pointing to. Also, you can construct examples where the pointer's target is contingent on data flow (e.g., an if/then/else statement). You also cannot apply the optimization if the pointer is an argument to a procedure.

I suppose if you limit this optimization to "local" code that is not contingent on the above cases, then you might be able to find the target array looking at (I believe) A_ASN asts in your traversal. Also, keep in mind that some pointer assignments are achieved through a call to a runtime function.

-Mark

michalpasztamobica commented 3 years ago

Thanks for your help, @mleair . Ad 1) I wonder how to traverse all available ASTs? I understand they are stored in astb.astli which is basically an array of all AST nodes. I also understand that there might be multiple unconnected ASTs in that array (is this correct)? Traversing function starting with ASTLI_HEAD just goes through two ASTs : a Unary operator and an ID, which ends the recursion possibilities. I tried iterating through the whole astb.astli array but I am unable to form the end condition - I could not find the size of the array and I also see that some nodes have the next field empty. Should I be looking at A_ALIASG instead? (I infer this from reading other functions such as sym_of_ast2 but I am not sure if this is the intention of the code)...

Ad 2) Your solution sounds promising, but I will leave it until I solve point 1 if I see that the ASTs corresponding to the local arrays are not yielding the correct result when passed to dd_array_conflict().

I am pretty stuck with the current solution, so I'll try to take step back and rethink if there is any way to figure out that we don't need optimization at an earlier stage or store the ASTs and SPTRs until we reach the relevant place in code.

mleair commented 3 years ago

How are you traversing the ASTs? Did you use the ast_traverse() function? I believe it's supposed to visit ASTs at most once.

michalpasztamobica commented 3 years ago

This is how I approached the problem:

static found_ast = 0;
static LOGICAL
_fam(int ast, int *fm_p)
{
  int sptr_to_find = *fm_p;
  if (ast_is_sym(ast) && memsym_of_ast(ast) == sptr_to_find) {
    found_ast = ast;
    return TRUE;
  }
  return FALSE;
}

....

int sptr_to_find = esptr_ptr;
ast_traverse_all(ASTLI_HEAD, _fam, NULL, &sptr_to_find);

michalpasztamobica commented 3 years ago

I approached the problem from the other end and tried to find how the function arguments are resolved when function gets called. I have set up a whole lot of breakpoints and flang's flow seems to indicate that it first performs the forall analysis, deciding that the arrays overlap and only later does it enter this code, which evaluates the actual arguments passed to the function.

I do hope I got things wrong, as at first it seemed crucial to me that the generated function's code knows exactly what its input pointer arguments point to. But on a second thought - this will all be evaluated in runtime anyway, so I can imagine that the function gets all analyzed and optimized first and only later are its arguments resolved.

If I got things right then I do not see a way for the compiler to realize anything about the arrays passed as pointers to a function, as it simply cannot know the pointees. The only way forward would be to re-do the analysis once we get the pointees right. However - what if the function is called multiple times with different arguments and sometimes they overlap and sometimes they don't? This would mean we need to create the temp array on one occasion and not on another.

I am planning to check how gfortran goes about this. No other clue how to proceed.

pawosm-arm commented 3 years ago

I'm skeptical on our ability of replicating gfortran logic behind this. As using -Hx,4,0x100000 flag prevents flang from generating calls to those heap allocation functions (resulting in gfortran-like performance on a test program), it may be worth considering a solution based on what this flag imposes: for those who are (quite correctly) concerned about using flag like that for their entire project (or entire source code file), it may be worth introducing loop-wise pragma, e.g. !dir$ nodepchk that would result in making the same assumption for such annotated loop as if -Hx,4,0x100000 flag was given to flang.

michalpasztamobica commented 3 years ago

I analyzed the gcc/fortran/dependency.c file and the different functions inside it. I think the most promising one is gfc_check_dependency() which checks if arrays are dependent on each other. I was trying to check how it runs with the reproducer from Kiran, but the gfortran binary I built has only one fortran-related file: gcc/gcc-10.2.0/gcc/fortran/gfortranspec.c (I checked after loading it to gdb with list sources). I need to figure out how gfortran works, maybe it's calling some subbinaries like flang does with flang1 and flang2.

michalpasztamobica commented 3 years ago

I managed to attach gdb to the f951 subprocess which contains the dependency.c file symbols. But it didn't break neither on gfc_check_dependency() not on any other function in that file. I am trying to figure out how exactly does gfortran go about compiling this particular case.

michalpasztamobica commented 3 years ago

Regarding the gfortran trace: I failed to trigger the functions checking array dependency. I tried with modified versions of the reproduces, but I never got to any of these calls. I reached the forall resolution, but it doesn't seem to perform any dependency checks for the arrays... I admit I still haven't come to full understanding of how this exactly works.

I tried to compare memory usage between gfortran-compiled and flang-compiled binaries, but gfortran's one allocates memory in large chunks of 1kB so it is harder to tell differences. When I tried to do larger increases, the allocs in gofrtran-compiled binaries also jumped by an order of magnitude, hiding the actual memory results. I haven't found a way to control this by any compilation flags or configurations and I am reluctant to modify the sources, as this is a side track of the main issue I am working on here.

I went back to the root of the problem and tried to investigate how flang handles the dd_array_conflict() function. As I wrote earlier, we never get to see the crucial part of the dependency analysis, because the array in the reproducer is "non-linear". The mkSub() function determines this when it reaches a SUBSCR AST instead of the expected ID or CONST, which to me indicates that a multi-dimensional array is the issue. It isn't indicated anywhere clearly, but to me this means that only linear (single-dimensional) arrays could potentially be handled by the core of the algorithm.

Then I tried the same with a simple single-dimensional array, however the earlier part of the algorithm managed to figure out that there is no dependency and the core do_subscript() function got never called, not even the dd_array_conflict() which does all the preliminary checks, like the one for array being linear or not.

I am out of ideas on how to proceed with this right now.

pawosm-arm commented 3 years ago

I'm starting to suspect that the decision this compiler would have to make cannot be made at compile time. Maybe all we need is a different approach: the compiler should generate two variants of the code: with and without use of the temporaries, and the decision which branch should be taken could be made only at the execution time using the runtime information stored in the array descriptors. AFAIR those descriptors hold enough information to verify that there's no overlap requiring the use of temporaries for avoiding unexpected overwrite. The comparison logic could be added as a function to the Flang runtime library, while compiler could emit a condition based on a call to that function before the entrance to the affected loop.

michalpasztamobica commented 3 years ago

@pawosm-arm , I agree with you to much extent. However, I read what Kiran wrote above a couple times more and I came to a conclusion that he means that we should detect situations where there is no temporary needed, no matter what arguments are passed to the function. The example Kiran has given:

  do i=1,n
    val2(1:dim1,i) = val2(1:dim1,i)*val1(i)
  end do

will need no temporary even if val2 and val1 actually point to the same array. I think this is what Kiran had on mind that we should detect - a situation where we access values in a very ordered and predictable manner.

However, the issue is that this example is multidimensional array and thus - non-linear. I find it hard to estimate the effort to rewrite the whole algorithm to take this example into consideration.

pawosm-arm commented 3 years ago

Michal, what I wanted to say is that we may need to redefine the problem to make finding a solution more realistic.

michalpasztamobica commented 3 years ago

I had a very insightful call today with @pawosm-arm . Many thanks for your time and recommendations!

The dd_array_conflict() seems to be dead end - at this stage of compilation flang is unable to decide if the operations performed in the loop will require a temporary array or not. It is only able to detect simple aliasing in indices.

Flang2 would be a more likely place to find this kind of optimizations, but I did not find any explicit array optimizations in flang2. I was looking for lines containing "array" and checking if they have any traits of conflict detection or other analysis or optimization. Perhaps I missed something, but at this point this seems to be a dead end to me.

Adding an optimization pass in flang2 that would prevent the temporary array creation is hard for me to estimate. @pawosm-arm suggested it should rather be measured in manmonths.

The most likely way forward, that @pawosm-arm suggested is to check if llvm as such is able to perform this kind of "optimization pass" for flang. To do this I need to: 1) find an optimisation pass in LLVM that would accomplish what we're trying to achieve here and 2) Check the contents of LLVM IR and see if they are sufficient for this pass.

Another possibility suggested by @pawosm-arm is to look at Polly and see if this could be of any help with the task. This however also means giving up on flang.

@kiranchandramohan, @RichBarton-Arm , is the idea of looking for the optimization in LLVM rather than flang still in line with the requirements of this task?

kiranchandramohan commented 3 years ago

Thanks @michalpasztamobica for the update. I agree that this is a hard problem in general. I tried the nvidia compiler and it is also not able to optimise away the temporary array allocation. So it might be a big task to implement in flang2 as @pawosm-arm says.

But then this bug is only talking about a specific case where in the array evaluation there is total overlap between the indices of val2 on the lhs and rhs and there is a single value, val1(i), that is read from another array. In this particular case there is no dependency, as explained in the description of this ticket. Does @pawosm-arm, @michalpasztamobica, @mleair, @bryanpkc and @shivaramaarao agree? If so, then can we think of a way to fix this particular case? If not then I think we can mark this as too difficult to fix.

  do i=1,n
    val2(1:dim1,i) = val2(1:dim1,i)*val1(i)
  end do

pawosm-arm commented 3 years ago

Hi @kiranchandramohan, The code snippet you quoted above comes from the 'Background' comment you've kindly wrote above:

There is no temporary needed here since even if val1 aliases with val2, it does not affect the evaluation. The evaluation of the array assignment expression here depends only on a single value in val1, i.e. val1(i).

This is very easy to see for a human reader how the data held in val1 and val2 are accessed from this loop. Sadly, it is not so trivial from the compiler point of view at the early stage of parsing and syntax analysis (namely, at the flang1 stage). I've mentioned Polly during today's call with @michalpasztamobica as this is the framework of which I know it is capable to provide the information we need: it builds its own model of the loop-wise data accesses basing on LLVM IR. Any attempt to do so in flang1 would require looking into what the loop does with the data, in effect it would require description of this loop in form of some kind of intermediate representation. Hence I've initially suggested to shift the focus on flang2, which works on flang's intermediate representations: ILM and ILI. Trouble is, it is much easier to find any well documented examples of the optimization passes working on LLVM-IR than anything describing similar things at ILM/ILI level. Hence I've suggested a further shift to the LLVM middle-end and either implement Polly-inspired analysis+optimization pass that could prove legality of bypassing temporaries in data accesses and simplifying them (in effect making any use of temporaties a dead code). My real concern is that even with having something that more or less works (at least with our examples) we may need to spend a lot of time on proving our solution does not do any damage to any corner cases one can imagine.

michalpasztamobica commented 3 years ago

I am not as much of an expert in compilers as @pawosm-arm , but I also find it hard to write the appropriate code in flang1, that would detect the collision.

In the comments above you can see how I tried and failed to address the issue by improving the dd_array_conflict(), which is the only place I found so far dealing with such cases. The reason this is so hard to accomplish might be that the information I am looking for is simply missing and the array conflict detection function is only able to detect simple aliasing and not deeply analyze the loop and what it is doing. Perhaps the information is available, but will require some heavy adjustments. Actually, I didn't even reach the main body of the dependency analysis. In my opinion it is not designed to work in our case.

Furthermore, I found array dependency analysis code in gfortran and it looked similar to flang's and did not trigger in the case described in this issue report.

I would be glad to also hear others opinion.

Regarding the alternative ways to address the issue outside of flang, I checked available passes in LLVM and did not find any one fitting our purpose, so we would probably have to write one from scratch. In Polly, the only promising pass I found was polly-dependences, but this is actually addressing reduction dependencies, so again - we'd have to create something from scratch.

pawosm-arm commented 3 years ago

In Polly, the only promising pass I found was polly-dependences, but this is actually addressing reduction dependencies, so again - we'd have to create something from scratch.

Polly won't offer ready to use solution, but it has mechanisms that could make working on one so much easier. Namely, it works on its own internal model of data accesses occurring within the loop, which makes them so much easier to examine,

kiranchandramohan commented 3 years ago

Thanks @pawosm-arm for the reply. Agree that it is not trivial from the compiler point of view. An LLVM pass might also require substantial work and will involve teaching it about the flang memory allocation and the alias information available might just say that the reads and writes can overlap and hence the temporary cannot be removed.

@michalpasztamobica thanks for your efforts in trying to fix this. I don't have any immediate suggestions.

flang-compiler / flang

[Performance issue] Temporary array not removed #924