flintlib / flint

FLINT (Fast Library for Number Theory)
http://www.flintlib.org
GNU Lesser General Public License v3.0
436 stars 243 forks source link

flint 2.6.0 failing on architectures it didn't before #771

Closed SnarkBoojum closed 4 years ago

SnarkBoojum commented 4 years ago

Flint used to compile perfectly on all major Debian architectures ; it's now failing on four of them; see the build logs for details.

wbhart commented 4 years ago

It looks like the two arm machines abort in fmpz_factor in the t-pollard_brent test, though no failure message is printed, which is odd.

On i386 it simply hangs in one of the fq_nmod_mpoly tests.

The mipsel one seems to be a bus error in t-div in fmpq_mpoly.

Unfortunately I haven't a clue about any of them at present. We'll have to start by valgrinding them all and seeing if we can get some clues.

@tthsqe12 will want to see these too.

I am on annual leave for about another week. I will look at these again when I return to work.

SnarkBoojum commented 4 years ago

There's no real urgency ; enjoy your vacations!

fingolfin commented 4 years ago

It looks as if all failing architectures are 32bit, all passing are 64bit. Maybe that gives a clue.

In this context, it would probably make sense to add a 32bit build to the Travis CI setup. See here https://github.com/gap-system/gap/blob/9a513e02d646145a5393cba0fa74a4cde970d65b/.travis.yml#L43 for how we do that in GAP (besides installing the relevant debian 32bit packages, one has to instruct the build system to build 32bit binaries, e.g. by adding -m32 to CFLAGS.

tthsqe12 commented 4 years ago

It works on arm64! Never tested there...

wbhart commented 4 years ago

We have 32 bit tests on Appveyor. They all pass there. This doesn't give me any clues unfortunately. It's honestly not surprising that some of these weird and wonderful arches fail. The main failure which bothers me is the pollard-brent test, as I see no reason for it to fail. The other stuff probably fails because it is MIPS and ARM which we definitely don't support currently. On ARM we need some special library to make the mpoly stuff correct. @tthsqe12 probably can explain this better than me.

There's been a project to port to MIPS for a long time, but it never got completed. To be honest it is remarkable it passed on so many arches. We weren't expecting that. Given there are so few failures we might be able to push another one or two arches through, though the only one I see actually being worthwhile is x86 which should under no circumstances be failing.

fingolfin commented 4 years ago

We have 32 bit tests on Appveyor.

Hmm, I don't see them in your travis config. In https://github.com/wbhart/flint2/blob/trunk/.build_dependencies you use an env var ABI but that is never set anywhere I could find?

wbhart commented 4 years ago

Why would the travis config have the Appveyor information?

fingolfin commented 4 years ago

Sorry, I somehow had a braino and failed to process "AppVeyor". Argh.

(I still think it would be useful to also add 32bit tests on Travis, though, as that would be on a Ubuntu VM, and so much closer to the systems where the issues @SnarkBoojum reports appear.)

wbhart commented 4 years ago

I'll add them temporarily some time next week. But the tests already take too long to run, so I'm not duplicating 32 bit builds (which generally shouldn't cause any special problems) across two testing platforms. 32 bit is far less important than it used to be.

SnarkBoojum commented 4 years ago

Is there something I can do about this issue?

wbhart commented 4 years ago

Not really, unless there's a way to give us access to the machines in question. I'm back to work tomorrow and this will have priority.

tthsqe12 commented 4 years ago

Just for clarification, the code that will surely fail on ARM is supposed to be disabled in the config. Thus, any fails are unexpected. Unless flint is upgraded to approximately C11, it is not possible to express within the language the appropriate memory barriers. Thus, the disabled mpoly code will always be disabled on ARM for the foreseeable future (and the test code should still pass).

wbhart commented 4 years ago

It turns out that this is going to be a minor problem. We will fairly soon have to fix this, I believe. I have let @fieker know this is going to be an important issue for Oscar in the fairly near future, though I'm pretty sure I got the details wrong. It would be helpful to discuss this along with the other Flint issues that need to be resolved @tthsqe12 Can you tell me when you are available today for a meeting?

wbhart commented 4 years ago

I enable ABI=32 on linux temporarily on Travis (I think). Let's see what happens.

I can confirm on Appveyor it uses an x86_64 machine with gcc (presumably mingw), so that is quite different, though should technically be harder to get to pass!

wbhart commented 4 years ago

Unfortunately MPFR doesn't seem to build. It complains:

error: GMP_NUMB_BITS and sizeof(mp_limb_t) are not consistent.

This doesn't make any sense, as MPFR is supposed to get its configuration options from GMP, and I can see both -m32 and ABI=32 appear there. It certainly is building a 32 bit MPIR because I had to install gcc-multilib to make it work, otherwise the 32 bit headers are missing and MPIR doesn't compile.

This used to work, but something has apparently broken it. I have no idea what. But it's nothing to do with the Flint library itself (and apparently doesn't seem to be a problem on debian).

https://api.travis-ci.org/v3/job/704055457/log.txt

If anyone has any ideas, let me know. The file building MPIR and MPFR is here:

https://github.com/wbhart/flint2/blob/trunk/.build_dependencies

I also tried MPIR-3.0.0, but the problem is the same.

https://github.com/wbhart/flint2/blob/flint-2.6/.build_dependencies

https://api.travis-ci.org/v3/job/704070762/log.txt

This all used to work, but no longer does. I've no idea why not.

wbhart commented 4 years ago

I tried to do this manually on my machine, but x86 ELF format is not supported in the WSL. So that was two hours of wasted time.

wbhart commented 4 years ago

None of the other machines I have access to have Ubuntu/Debian and gcc-multilib installed. So it's simply not possible for me to test on x86 32 bit. I'm going to have to give up as I've spent nearly two days trying.

Until someone can give me access to the machines involved, there's nothing more I can do.

wbhart commented 4 years ago

I finally managed to coerce Travis into building a x86 Flint. Some issue with MPFR is preventing it from getting its configuration from either GMP or MPIR so I had to force it to use -m32 by setting CC.

Anyhow, Flint passes its test suite on x86 just fine (this is Ubuntu Xenial). So I don't have any idea what could be going on on Debian. The build and tests passing seems to be reproducible. I'm wondering if the x86 problem on Debian is.

What GMP and MPFR are being used? How is Flint configured? Is the build being done from a clean checkout, or is there some cache that could be dirty?

wbhart commented 4 years ago

Here is the x86 build on Travis.

https://travis-ci.org/github/wbhart/flint2/jobs/704533956

SnarkBoojum commented 4 years ago

Debian build systems start from scratch, with GMP 6.2.0, MPFR 4.0.2 and NTL 11.4.3.

The configure line is in the log:

./configure --prefix="/usr" --with-ntl CC='i686-linux-gnu-gcc' CXX='i686-linux-gnu-g++' CFLAGS='-Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security'
SnarkBoojum commented 4 years ago

I tried to build on armhf by hand:

pollard_brent_single....FAIL : Pollard Rho failed too many times (106 times)
wbhart commented 4 years ago

One possible difference is you are probably not setting the flint_test_multiplier to 1 like us. I'll try changing that back to 10 next and see if I can replicate that way.

The pollard_brent failure looks like it might be acceptable. I'll look into that. We might just need to change the number of allowed failures. It's probably currently set to 100. So that would be an easy fix.

I'll also switch to MPFR 4.0.2 but I'm sure that isn't relevant.

Thanks for trying that out.

Any chance of retrieving messages from any of the other arches? It's hard to debug without clues.

SnarkBoojum commented 4 years ago

I'm currently running "make -k check > log 2>&1" on an armhf box to see what the other failures on that platform are.

I'll try on others.

SnarkBoojum commented 4 years ago

I'm not setting flint_test_multiplier to anything different than what is in the source... though I'm wondering why it's a function returning an int and not a #define.

wbhart commented 4 years ago

I would expect the failure on armhf to be the same.

If flint_test_multiplier is 10 then pollard_rho is failing 106 times out of 400 instead of a maximum allowed of 20. That does seem like a problem, especially given it's not happening on other 32 bit arches. Also, the behaviour shouldn't be that different on 32 and 64 bits anyway. Technically it will call the n_pollard_rho function more often on 64 bits. But that wouldn't explain why we don't see the same problem on other 32 bit machines.

Is there anything unusual about sizeof int, long or pointers on 32 bit arm that I could look for, as say compared with x86 32 bits?

SnarkBoojum commented 4 years ago

On armhf, pollard_brent_single is the only failure ; I'm recompiling with flint_test_multiplier set to 1, then I'll re-run the tests. I have started compiling on armel and hppa (with flint_test_multiplier set to the default 10).

I'll report when I have more. I have no clue what 32bit ARM can have now that it didn't with last version...

wbhart commented 4 years ago

I'm sure ARM hasn't changed, but our pollard_rho has. I'm looking for clues as to what features of ARM might conflict with the changes I made (which were all valid bug fixes as far as I am aware).

wbhart commented 4 years ago

I've likewise started a run with flint_test_multiplier = 10 on x86. I don't know why they chose to make this a function rather than a #define, but I assume they had a reason.

fredrik-johansson commented 4 years ago

The reason is that you can change the multiplier without having to recompile every single test program.

wbhart commented 4 years ago

So the results of an flint_test_multiplier=10 on x86 were interesting. The pollard_rho passes. So that indicates an issue with ARM rather than 32 bits per se. It would be interesting to revert commit aad2863 (just two lines) and see how many failures are being caused by that.

However, on x86 the increased test multiplier causes a hang in fq_nmod_mpoly (I don't know which function as the tests are run in parallel). At least @tthsqe12 and I can look into that on travis.

wbhart commented 4 years ago

The hang is in fq_nmod_mpoly_divrem_monagan_pearce.

https://travis-ci.org/github/wbhart/flint2/jobs/704750064

I will put some traces in to narrow it down further.

wbhart commented 4 years ago

The hang is in the fifth section of test code: / Check aliasing of remainder with first argument /

I'll try to get the actual polynomials.

wbhart commented 4 years ago

It's not the actual divrem_monagan_pearce function that is hanging. It seems to be the test code itself that is hanging. So that should be easy to fix (unless it is due to some prior memory corruption in the actual divrem call, which returns without problems).

wbhart commented 4 years ago

The hang is in the do...while loop at approximately line 290 in t-divrem_monagan_pearce in fq_nmod_mpoly. That should be enough for @tthsqe12 to come up with a fix that he's happy with.

One bug "down" two more to go.

SnarkBoojum commented 4 years ago

I'm not finished running the armhf tests with flint_test_multiplier returning 1, and mipsel and hppa are still compiling... I'll only be able to report tomorrow.

wbhart commented 4 years ago

@SnarkBoojum What would be the easiest way to try reverting patch aad2863 ? If I tag a branch with that patch reverted, would that be sufficient to not create too much work for you? It would definitely be interesting to see if that is causing any of the failures we see.

tthsqe12 commented 4 years ago

Yes. That is an easy bug to fix in the test code. The funny thing that must be happening is: the length is even, the exponents are fixed at zero, and Fq is F2. So, g is always zero.

SnarkBoojum commented 4 years ago

If you have a dirty patch, I can test it by hand on a few platforms (search "porterbox" purpose here) ; if you have a clean patch, I can even upload a 2.6.0-3 in Debian to replace the failing 2.6.0-2, and the resulting package will get built everywhere

wbhart commented 4 years ago

The dirty patch is probably sufficient. Just change the two mp_limb_t's to unsigned int in flint.h as shown (in reverse) here:

https://github.com/wbhart/flint2/commit/aad28633f1bf88c47436fdc3ab9e79220829b13a

There's no need to test it on arches which passed or on the one where fq_nmod_mpoly was hanging (that's unrelated for certain).

We do know this patch is causing problems, though I don't completely understand why just yet.

tthsqe12 commented 4 years ago

I do not see a point in testing a change unsigned int <-> mp_limb_t on a system where these are the same type. These are the exact same type here, right?

wbhart commented 4 years ago

I have no idea. How big are int, long and void* on these two ARM ABIs?

SnarkBoojum commented 4 years ago

With flint_test_multiplier returning 1, armhf fails on the same test with Pollard Rho failed too many times (10 times)

SnarkBoojum commented 4 years ago

(sizeof(int), sizeof(long), sizeof(void*)) is (4,4,4) on mipsel, on hppa and on armhf.

SnarkBoojum commented 4 years ago

For mipsel, gdb has the following to say about the t-div issue:

Program received signal SIGBUS, Bus error.
0x77d70b28 in fmpz_mpoly_quasidivrem_heap (scale=<optimized out>, q=<optimized out>, r=<optimized out>, poly2=0x7fff4478, poly3=<optimized out>, ctx=<optimized out>) at quasidivrem_heap.c:868
868           exp2 = (ulong *) flint_malloc(N*poly2->length*sizeof(ulong));

and I could check that N=12 and poly2->length=19.

SnarkBoojum commented 4 years ago

I put flint_test_multiplier back to 10 and applied the dirty patch on armhf ; I'll report when it will have finished recompiling & running tests.

SnarkBoojum commented 4 years ago

On hppa, running make check seems to get stuck in the divrem_monagan_pearce item.

wbhart commented 4 years ago

That's a lot of great info, thanks. @tthsqe12 is right that if sizeof is 4, 4, 4 on these machines, that dirty patch won't do anything. It was only a hope for systems where they are not all the same.

I will simply have to look more carefully at the pollard_rho code and hope something jumps out at me. It definitely shouldn't fail so often. As these are 32 bit machines that means the n_pollard_rho code is probably being called less often and may be the source of the failures. But as I say, I don't have a clue why ARM should break this code at this stage.

SnarkBoojum commented 4 years ago

I can sponsor someone for a guest account on the armhf box (abel.debian.org), here is the procedure.

SnarkBoojum commented 4 years ago

I think I'll open different bug report : not all failing platforms have the same issue.

SnarkBoojum commented 4 years ago

I opened #784 for HPPA, #785 for MIPSEL and #786 for ARMHF.