Closed pawosm-arm closed 3 years ago
@mleair considering this code originates from the initial commit, I think only you may know the answer...
@pawosm-arm You might be right. Can you provide a test case that goes into an infinite loop?
Hi @mleair, unfortunately, I don't have a test case or a minimal reproducer. Trouble is, this bug did not manifest itself for decades until I tried to compile latest git master of cp2k (a code newer than the latest cp2k official release) where the problem revealed itself on a significantly large file which imported some number of modules, therefore it had a potential to put a strain on any name searching function. With a problem of that size, it is hard to track what was the unfortunate sequence of searches which resulted in entering this infinite loop.
@pawosm-arm I am not too familar with this code in the compiler. It's obviously doing some sort of search on a hash value. Maybe there is a bug in the hash function? Or maybe the bug is really somewhere else in the compiler and your change masks the real bug. I suggest you find the problematic code in cp2k. How many source files are there? Maybe you can find the problematic source file using an interative process where you compile all but one file with this change. Then test the application. If the test passes, then move onto the next file until you get a failure. Once you find the file, you could instrument your compiler to apply the change to certain functions. Or if there are not very many functions in the problematic source file, you probably can use a debugger.
@mleair I've tried to investigate this a little more, still no luck. Responding to your questions:
How many source files are there?
A lot. Really lot. And they are huge. With crazy dependencies: the largest of the files import hundreds of modules which also import tens of modules and this is a recursive process. All those places define thousands of names. Bottom line: this can really expose any flaw in any name finding routine.
Maybe you can find the problematic source file
I've prepared a following change in flang2
:
diff --git a/tools/flang2/flang2exe/llassem.cpp b/tools/flang2/flang2exe/llassem.cpp
index 50343c05..c8ddce68 100644
--- a/tools/flang2/flang2exe/llassem.cpp
+++ b/tools/flang2/flang2exe/llassem.cpp
@@ -5442,18 +5442,22 @@ find_funcptr_name(SPTR sptr)
/* Key */
sprintf(sptrnm, "%s_%d", get_llvm_name(sptr), sptr); /* Local name */
len = strlen(sptrnm);
+ fprintf(stderr, "[%s] len %d\n", sptrnm, len);
hashval = name_to_hash(sptrnm, len);
for (gblsym = fptr_local.hashtb[hashval]; gblsym;
gblsym = FPTR_HASHLK(gblsym)) {
np = sptrnm;
sp = FPTR_NAME(gblsym);
+ fprintf(stderr, "FPTR_NAME[%s], gblsym = %d, hash = %d, sp[%s], np[%s], sptrnm[%s], np - sptrnm = %d, len = %d\n", FPTR_NAME(gblsym), gblsym, FPTR_HASHLK(gblsym), sp, np, sptrnm, np - sptrnm, len);
do {
if (*np++ != *sp++)
goto Continue;
} while (*sp);
- if (np - sptrnm != len)
- continue;
+ if (np - sptrnm != len) {
+ fprintf(stderr, "stalled: FPTR_NAME[%s], gblsym = %d, hash = %d, sp[%s], np[%s], sptrnm[%s], np - sptrnm = %d, len = %d\n", FPTR_NAME(gblsym), gblsym, FPTR_HASHLK(gblsym), sp, np, sptrnm, np - sptrnm, len);
+ exit(-1);
+ }
goto Found;
Continue:
if (gblsym == FPTR_HASHLK(gblsym))
On the trouble making source file it shows the following:
[timestop_interface__3766] len 24
FPTR_NAME[timestop_interface__376], gblsym = 5, hash = 5, sp[timestop_interface__376], np[timestop_interface__3766], sptrnm[timestop_interface__3766], np - sptrnm = 0, len = 24
stalled: FPTR_NAME[timestop_interface__376], gblsym = 5, hash = 5, sp[], np[6], sptrnm[timestop_interface__3766], np - sptrnm = 23, len = 24
So the infinite loop (the unfortunate continue
branch) was entered due to a comparison between "timestop_interface__3766" and "timestop_interface__376" (one is a substring of the other). This is an extremely rare situation. Is there a requirement that for cases like this the gblsym
never equals FPTR_HASHLK(gblsym)
? If yes, then this continue
statement (instead of goto Continue
) makes sense and the problem is really somewhere else.
Unfortunately, I don't know how to fabricate a minimalistic reproducer: this situation is very fragile to changes in the code. About a week ago a change was introduced in the cp2k git repo. Not in the affected source file. Not even in the file where timestop_interface
is defined. Not even in the module which the affected source file imports which imports the module where timestop_interface
is defined (yes, this is not directly imported from the affected source file!), the change was introduced in some other module imported by some module imported by the affected source file. And it was sufficient to make this unfortunate numeric id's to be generated differently and the clash between 3766 and 376 has gone away. From one side I'm happy: I can build cp2k now, yet this is so fragile it may break again with the new commits the cp2k developers are preparing for the future. I need some advice where to look now...
Unless I'm missing somthing, it looks like the string and substring have the same hash value.
The FPTR_HASHLK field is set in the function llvm_funcptr_store() in the same file. It's set by a call to name_to_hash() which contains the hash function. I suspect you stumbled on an edge case where a string and a substring receive the same hash value. Maybe the hash function in name_to_hash() can be modified to give a unique value in this case. Maybe a change to AG_HASHSZ is all you need. If you want to try adjusting it, you'll want to use another prime number (it's currently set to 19).
Edit: So, it looks like find_funcptr_name() is called by llvm_funcptr_store(). Not sure if that matters. But my point is, I suggest that you look a little closer at the hash function. Also, I'm not sure if there is a mechanism for handling hash function collisions (i.e., when two strings map to the same hash value). If collisions are handled, then the fix may be to check the other buckets associated with the hash value (i.e., in a way similar to the HASHLKG macro in flang1).
Changing AG_HASHSZ
won't help here. I've extracted this name_to_hash()
function and have called it with some values. The problem became apparent as soon as I called it for a string longer than just one character, where the last character was repeated:
str[a]: hash size: 19 hash: 12
str[aa]: hash size: 19 hash: 11
str[aaa]: hash size: 19 hash: 11
str[aaaaaa]: hash size: 19 hash: 11
str[laaaaa]: hash size: 19 hash: 9
str[laaaaaa]: hash size: 19 hash: 9
str[a]: hash size: 997 hash: 217
str[aa]: hash size: 997 hash: 124
str[aaa]: hash size: 997 hash: 124
str[aaaaaaaa]: hash size: 997 hash: 124
str[laaaaa]: hash size: 997 hash: 189
str[laaaaaaa]: hash size: 997 hash: 189
I think different hashing function is needed.
There's no such thing as a perfect hash function in this case. Although, there may be better hash functions for what's being hashed. Maybe there is a better hash function in this case. However, if there is no mechanism for dealing with collisions, such as a list of strings associated with each hash value, then perhaps someone can add it. I'm not very familiar with this code in flang2, but after a quick look at it, I didn't see any mechanism for handling collisions. It seems to assume one string per hash value.
You can see how we handle the symbol table hash values in flang1 (i.e., with the HASHLKG macro) for an example where we have hash values that map to more than one symbol.
Hi @mleair, I've looked more carefully into llassem.cpp
file and found that collisions are indeed handled, but due to a terrible mistake this thing couldn't really work. The code behind the Continue:
label must have been added by someone who faced infinite cycle in the processed linked list. Instead of looking for the origin of this infinite cycle, someone just silenced the problem by putting return 0;
. I've compared other places where hash collisions are handled in flang, and have managed to track down a critical difference. Eventually, I've came up with a following patch:
diff --git a/tools/flang2/flang2exe/llassem.cpp b/tools/flang2/flang2exe/llassem.cpp
index 50343c05..479f3360 100644
--- a/tools/flang2/flang2exe/llassem.cpp
+++ b/tools/flang2/flang2exe/llassem.cpp
@@ -5453,11 +5453,11 @@ find_funcptr_name(SPTR sptr)
goto Continue;
} while (*sp);
if (np - sptrnm != len)
- continue;
+ goto Continue;
goto Found;
Continue:
if (gblsym == FPTR_HASHLK(gblsym))
- return 0;
+ error((error_code_t)0, ERR_Fatal, 0, "Broken hash link on:", sptrnm);
}
return 0;
@@ -5511,8 +5511,8 @@ llvm_funcptr_store(SPTR sptr, char *ag_name)
sprintf(sptrnm, "%s_%d", get_llvm_name(sptr), sptr);
hashval = name_to_hash(sptrnm, strlen(sptrnm));
- fptr_local.hashtb[hashval] = gblsym;
FPTR_HASHLK(gblsym) = fptr_local.hashtb[hashval];
+ fptr_local.hashtb[hashval] = gblsym;
FPTR_SYMLK(gblsym) = ptr_local;
nmptr = add_ag_fptr_name(sptrnm); /* fnptr_local key */
FPTR_NMPTR(gblsym) = nmptr;
The first half of this patch turns silencing of the problem into exposing it explicitly. The second half is the essence.
Basically, handling fptr_local
is the only place where those two assignment operations were swapped over. It resulted in situation as follows:
B = C
A = B
...resulting in:
A == C
B == C
...while all other places are done like that:
A = B
B = C
...resulting in:
A == old B
B == C
After applying this patch, infinite loop has gone.
Looks good. Good job tracking down this bug. Yeah, I had a hard time believing that collisions were not handled by this hashing mechanism. However, I am not as familiar with this part of the compiler as the other parts, so I could not say for sure if (or how) collisions were handled. Glad you were able to track down the bug!
My only comment is that I suggest you call the interr() function instead of calling error(). The error you are reporting is an internal compiler error. We provide interr() for reporting internal compiler errors. The error() function is intended for user code errors.
Thanks for your reply. I'll make suggested adjustments and prepare PR with this patch later this week.
Fixed.
https://github.com/flang-compiler/flang/blob/93179b56533fcf50c0687f2d59cfddaf44068be2/tools/flang2/flang2exe/llassem.cpp#L5503 Shouldn't it be
goto Continue;
instead of justcontinue;
? I've encountered a case where(np - sptrnm != len)
and(gblsym == FPTR_HASHLK(gblsym))
, the former condition is never checked due to thiscontinue
and therefore it runs into an infinite loop. It would be checked if it was replaced bygoto Continue
.