gap-system / gap

Main development repository for GAP - Groups, Algorithms, Programming, a System for Computational Discrete Algebra
https://www.gap-system.org
GNU General Public License v2.0
811 stars 161 forks source link

HPC-GAP sometimes crashes when testing the type of a matrix #2008

Open jnkbrachter opened 6 years ago

jnkbrachter commented 6 years ago

Observed behaviour

We sometimes get an error in HPCGAP from line 445 of src/plist.c when testing the type of a matrix given as a PLIST. The line is assert(!IS_TESTING_PLIST(elm));

Analysis

The problem seems to come from two threads testing the type of a matrix as well as the type of one of its rows at the same time which can not happen when running in sequential mode. The problem does not always occur and was observed while running an algorithm that involves a lot of matrix multiplication. A similar problem should concern line 502 of the above mentioned file.

ChrisJefferson commented 6 years ago

Can you produce a (preferably small :) ) code example which causes this to happen? It would be very helpful in trying to debug it.

ChrisJefferson commented 6 years ago

Also, could you check (you might know this), is GAP turning the matrices into a more compressed format at some point -- that is, do they start out with IsPlistRep on the matrix, and it's rows, being one value, and changing by the end of the algorithm?

fingolfin commented 6 years ago

Is this with the latest master branch? Or with hpcgap-default? Or with what else?

fingolfin commented 6 years ago

Actually, nevermind, the issue is there, in either of the two branches. It is caused by GAP's method dispatch trying to (re)compute the type of lists (and lists-of-lists, such as "matrices"), and yeah, that's not really thread safe.

If you matrices are over small finite fields, you should be able to work around this by always storing your matrices as compressed matrices (e.g. by using ImmutableMatrix or ConvertToMatrixRep (be very careful with that one, though, don't use it on objects after they become visible in multiple threads). Also, the MatrixObj project will resolve this for good (for matrices).

Of course the issue as such still will be present for all other kinds of lists. Nasty. Will have to think about this...

Perhaps @rbehrends thought about this before, though, and has remarks or even a solution (on how to fix this in the kernel, I mean).

ssiccha commented 6 years ago

@fingolfin yes, @jnkbrachter and I wrote @rbehrends and he told us that it would be safe to just disable these asserts in HPCGAP.

fingolfin commented 6 years ago

I'd still like to know which branch this is on, resp. which state of it? IS_TESTING_PLIST doesn't exist on master anymore, so at best this is an older version of master, at worst it's hpcgap-default. Not that there will be any differene w.r.t to this issue.

As to @rbehrends saying it is "safe" to just disable this: I have serious doubts about this. If two threads try to compute the type of a matrix as list-of-lists, over say the integers, at the same time, and one sets TESTING for the first entry of the outer list; then the next thread will see that, and decide that that first list is not homogenous. So in the end, the threads will use different types for the same matrix, and that can lead to all kinds of problems.

The TESTING flag, just like the COPYING flag, is used to traverse a recursive data structure. In HPC-GAP, this is a problem, hence COPYING there was replaced by generic object traversal. I fear that for TESTING, the same will be necessary.

(I sometimes kind of wish this whole family of plist TNUMs never had been introduced, so many problems would simply vanish, and I believe the optimizations it results in could have been obtained in other ways... but it's much too late for that regret now shrug)

fingolfin commented 6 years ago

But perhaps I am missing something -- in that case, I hope @rbehrends will clarify with a comment here.

ssiccha commented 6 years ago

Ookay, in that case he maybe just meant that it is safe to disable that check for @jnkbrachter s version of the parallel gaussian algorithm.

ssiccha commented 6 years ago

The crash happened in some old version of HPCGAP. @jnkbrachter and I will have a look at whether something like this is reproducible in the current master branch.

ssiccha commented 6 years ago

@rbehrends apparently the assert(!IS_TESTING_PLIST(elm)); got replaced by assert(!TEST_OBJ_FLAG(elm, TESTING)); in lines 467 and 524 How does this impact the issue?

rbehrends commented 6 years ago

It's been a few months (the email was from February, so I'm trying to reconstruct my thoughts here) and the code has changed since then. So, I'm not entirely sure what I'm saying is still accurate.

But basically, KTNumPlist() returns early in GAP if it doesn't have write access (and thus does not modify the underlying objects). Thus, testing modifications to an object can only be done by at most one thread at once. (There is actually another related problem here that testing for a matrix fails if the matrix is read-only.)

Now, at the offending assert, there's some code that tests sublists regardless of access rights. This is where the problem originally happened (as I recall) because one thread was operating on the matrix and one on the row vectors. But for those tests, the assert failing was harmless.