Open jnkbrachter opened 6 years ago
Can you produce a (preferably small :) ) code example which causes this to happen? It would be very helpful in trying to debug it.
Also, could you check (you might know this), is GAP turning the matrices into a more compressed format at some point -- that is, do they start out with IsPlistRep
on the matrix, and it's rows, being one value, and changing by the end of the algorithm?
Is this with the latest master
branch? Or with hpcgap-default
? Or with what else?
Actually, nevermind, the issue is there, in either of the two branches. It is caused by GAP's method dispatch trying to (re)compute the type of lists (and lists-of-lists, such as "matrices"), and yeah, that's not really thread safe.
If you matrices are over small finite fields, you should be able to work around this by always storing your matrices as compressed matrices (e.g. by using ImmutableMatrix
or ConvertToMatrixRep
(be very careful with that one, though, don't use it on objects after they become visible in multiple threads).
Also, the MatrixObj
project will resolve this for good (for matrices).
Of course the issue as such still will be present for all other kinds of lists. Nasty. Will have to think about this...
Perhaps @rbehrends thought about this before, though, and has remarks or even a solution (on how to fix this in the kernel, I mean).
@fingolfin yes, @jnkbrachter and I wrote @rbehrends and he told us that it would be safe to just disable these asserts in HPCGAP.
I'd still like to know which branch this is on, resp. which state of it? IS_TESTING_PLIST
doesn't exist on master
anymore, so at best this is an older version of master
, at worst it's hpcgap-default. Not that there will be any differene w.r.t to this issue.
As to @rbehrends saying it is "safe" to just disable this: I have serious doubts about this. If two threads try to compute the type of a matrix as list-of-lists, over say the integers, at the same time, and one sets TESTING
for the first entry of the outer list; then the next thread will see that, and decide that that first list is not homogenous. So in the end, the threads will use different types for the same matrix, and that can lead to all kinds of problems.
The TESTING
flag, just like the COPYING
flag, is used to traverse a recursive data structure. In HPC-GAP, this is a problem, hence COPYING
there was replaced by generic object traversal. I fear that for TESTING
, the same will be necessary.
(I sometimes kind of wish this whole family of plist TNUMs never had been introduced, so many problems would simply vanish, and I believe the optimizations it results in could have been obtained in other ways... but it's much too late for that regret now shrug)
But perhaps I am missing something -- in that case, I hope @rbehrends will clarify with a comment here.
Ookay, in that case he maybe just meant that it is safe to disable that check for @jnkbrachter s version of the parallel gaussian algorithm.
The crash happened in some old version of HPCGAP. @jnkbrachter and I will have a look at whether something like this is reproducible in the current master branch.
@rbehrends apparently the assert(!IS_TESTING_PLIST(elm));
got replaced by assert(!TEST_OBJ_FLAG(elm, TESTING));
in lines 467 and 524
How does this impact the issue?
It's been a few months (the email was from February, so I'm trying to reconstruct my thoughts here) and the code has changed since then. So, I'm not entirely sure what I'm saying is still accurate.
But basically, KTNumPlist()
returns early in GAP if it doesn't have write access (and thus does not modify the underlying objects). Thus, testing modifications to an object can only be done by at most one thread at once. (There is actually another related problem here that testing for a matrix fails if the matrix is read-only.)
Now, at the offending assert, there's some code that tests sublists regardless of access rights. This is where the problem originally happened (as I recall) because one thread was operating on the matrix and one on the row vectors. But for those tests, the assert failing was harmless.
Observed behaviour
We sometimes get an error in HPCGAP from line 445 of src/plist.c when testing the type of a matrix given as a PLIST. The line is
assert(!IS_TESTING_PLIST(elm));
Analysis
The problem seems to come from two threads testing the type of a matrix as well as the type of one of its rows at the same time which can not happen when running in sequential mode. The problem does not always occur and was observed while running an algorithm that involves a lot of matrix multiplication. A similar problem should concern line 502 of the above mentioned file.