PDLPorters / pdl

Scientific computing with Perl
http://pdl.perl.org
Other
89 stars 44 forks source link

Fails to build on various architectures #369

Closed sebastic closed 2 years ago

sebastic commented 2 years ago

The Debian package for PDL 2.066 fails to build on various architectures, see:

https://buildd.debian.org/status/package.php?p=pdl&suite=experimental

mohawk2 commented 2 years ago

Thank you! We'll take a look.

mohawk2 commented 2 years ago

Is there any way I could get a login to machines with these architectures, so I could run gdb or similar to get a stack trace? The i386, armel, armhf Slatec failures look like they might be SEGVs, for a start.

sebastic commented 2 years ago

Not easily, you'd need a sponsor for a guest account on the Debian porterboxes:

https://dsa.debian.org/doc/guest-account/

i386 you can easily test in a chroot or VM on an amd64 system.

arm64 and a few other architectures also have reasonably good emulation in QEMU, e.g;

https://wiki.debian.org/Arm64Qemu

Amazon may also still have the free tier for their arm64 VMs.

mohawk2 commented 2 years ago

Thanks, that's a good start!

A common theme, including on ppc64, is:

Badvalue is set to 0 or 1. This will cause data loss when using badvalues for comparison operators. at ops.pd line 137.
    PDL::Ops::__ANON__(PDL=SCALAR(0x1002b456840), 0, "") called at t/bool.t line 23

Despite the large number of failures and diverse architectures, my feeling (or at least hope) is this is a small number of errors.

mohawk2 commented 2 years ago

Also, you wouldn't happen to have any insight into how to get any of these environments (at all) going with GitHub actions? I'd really like to catch this stuff in our CI. I'd also like to routinely run lintian to catch those errors sooner too.

By the way, is there any value in PDL's current debian directory existing? I've excluded it now from our MANIFEST since I saw one of your patches does that, but I want it to be relevant / helpful, or I'll remove it entirely (or replace it if you'd like to have that as the canonical location for Debian stuff, which means we could CI it with every commit - do other projects do that?).

mohawk2 commented 2 years ago

On that topic, I saw ironically that you recently dropped proj-bin as a dep. With the new Proj v6+ support, it would be effective to add them back, have Alien::proj as a purely build-only dep, and get the Proj4 modules in Debian PDL again?

sebastic commented 2 years ago

Despite the large number of failures and diverse architectures, my feeling (or at least hope) is this is a small number of errors.

Agreed. If you start by fixing the issues on i386 you likely have fixed those issues on the other 32bit architectures too.

Also, you wouldn't happen to have any insight into how to get any of these environments (at all) going with GitHub actions? I'd really like to catch this stuff in our CI.

I have no experience with GitHub actions, a little searching the docs finds me this which might be a good start:

https://github.com/marketplace/actions/run-on-architecture

I'd also like to routinely run lintian to catch those errors sooner too.

lintian only makes sense for Debian package builds, you could implement some of its tests in CI though.

By the way, is there any value in PDL's current debian directory existing? I've excluded it now from our MANIFEST since I saw one of your patches does that, but I want it to be relevant / helpful, or I'll remove it entirely (or replace it if you'd like to have that as the canonical location for Debian stuff, which means we could CI it with every commit - do other projects do that?).

If you want to build Debian packages yourself from the PDL source tree, there is still value keeping the directory.

It is not used for the package in Debian, we maintain our own packaging.

Since we have control over the infrastructure like Salsa, I will always maintain packaging there. I'm not an upstream developer, or even a user, which makes maintaining the packaging in the upstream repo make little sense.

If you get more involved with packaging PDL and its dependencies in Debian, you are in a much better position to maintain the upstream packaging.

On that topic, I saw ironically that you recently dropped proj-bin as a dep. With the new Proj v6+ support, it would be effective to add them back, have Alien::proj as a purely build-only dep, and get the Proj4 modules in Debian PDL again?

Alien::proj is not packaged for Debian, and since I don't actually use PDL and only keep the packaging on life support, I'm not willing to package it. This is why we had the pkg-config discussion to get the build flags in the other issue, it would remove the need for a new dependency to get packaged.

If I can build PDL with PROJ support by just reverting the removal the libproj-dev & proj-bin (build) dependencies I will do so. Based on the discussion I expected the pkg-config support to be available in PDL 2.064 and planned to revert the removal, but because it still requires Alien::proj I left it as-is.

mohawk2 commented 2 years ago

At least arm64 has char be unsigned. I've now specified SByte to be explicitly unsigned char, and released that as 2.068 - please let me know if that makes a difference. I believe that will pass at least on arm64, and possibly others. I haven't yet properly investigated the Slatec fails, which are next.

sebastic commented 2 years ago

Regarding Slatec, be aware of these changes:

https://salsa.debian.org/perl-team/modules/packages/pdl/-/blob/master/debian/patches/slatec_default_integer_8.patch

mohawk2 commented 2 years ago

Thanks!

sebastic commented 2 years ago

I believe that will pass at least on arm64, and possibly others.

Only arm64 so far:

https://buildd.debian.org/status/package.php?p=pdl&suite=experimental

mohawk2 commented 2 years ago

I've updated Slatec and Minuit to be (hopefully) 64-bit aware, and will release that in due course.

Meanwhile, the FITS failures all show:

substr outside of string at /<<PKGBUILDDIR>>/IO/FITS/../../blib/lib/PDL/IO/FITS.pm line 895.

which is code that uses the get_dataref mechanism to read directly into ndarray's data using substr. This implies the sizes of ndarrays might be wrong, which might be due to them being of an unexpectedly small type.

mohawk2 commented 2 years ago

This was in fact a small number of problems:

I'm leaving this issue open because we haven't yet formally confirmed that this actually is all the problems yet, but any remaining ones will hopefully be small and easily solved.

Lessons learned along the way:

mohawk2 commented 2 years ago

Packaging notes for myself:

PDL::FAQ will want a review of its packaging / installation notes. Also, do we actually really need to dep on Term::ReadKey at all?

mohawk2 commented 2 years ago

I did some extra checks for Slatec, using these techniques; first I replaced all the individual files in Libtmp/Slatec/Makefile with:

FFLAGS=-fPIC -std=legacy -cpp

slatec/bigun$(OBJ_EXT): slatec/bigun.f
        gfortran -c  $(FFLAGS)  -o slatec/bigun$(OBJ_EXT)   slatec/bigun.f

Then iterated to find further problems and fix them using:

(cd Libtmp/Slatec/;
  (for x in slatec/[c-z]*.f; do echo "#line 1 \"$x\""; cat $x; done) >bigun.f; mv bigun.f slatec/;
  time make test
) |&less

which lets the compiler see all the Fortran at once and so detect problems, while with the -cpp and #line directives still show the right file and line numbers. Interestingly, because there are 112 files there, it takes something like 2s to compile them individually, but only about 1.1s to compile the concatenated file.

sebastic commented 2 years ago

I'm leaving this issue open because we haven't yet formally confirmed that this actually is all the problems yet, but any remaining ones will hopefully be small and easily solved.

With the changes in 2.068_02 the build failure is fixed on ppc64el & s390x.

It still fails on armel, armhf, and i386.

mohawk2 commented 2 years ago

The most recent build shows failures for 32-bit (next one should be fine), but also GSL::LINALG failures on at least sparc64, ppc64; specifically the LU_det returns 0 rather than the expected roughly 0.07+1e-9i. No idea why.

mohawk2 commented 2 years ago

The latest problem is that t/transform.t fails on 32-bit, on only one test. So far this is down to the way whichND uses index for slices to update columns of the resulting coordinates affecting more than just the indicated column, but only on 32-bit!

sebastic commented 2 years ago

While the package now builds successfully on all release architectures, it only does so because PDL::GSL::LINALG is disabled on mips64el/mipsel, powerpc/ppc64, and sparc64.

To truly resolve this issue that should be fixed too.

mohawk2 commented 2 years ago

That module is disabled on those platforms due to a probable compiler bug which (as you know) has been reported. Nothing at all can be done in this library to fix that. @sebastic Do you agree, and if so could you close this? As you can see in the linked comment above, this thing staying open is causing people to get an inaccurate picture of this library's stability.

sebastic commented 2 years ago

Nothing has changed since my last comment, so I don't think this issue can be closed.

If you're unable or unwilling to truly resolve the issue on those architectures, please state so explicitly to inform any possible users of those architectures that they can't expect a fully functional PDL like on more popular architectures.

It's quite possible that there are zero actual users of PDL on those architectures, so it's quite acceptable to not spend more time on them.

mohawk2 commented 2 years ago

When you say "explicitly stated", where do you have in mind? It certainly seems valid to move the patch from the Debian packaging into main PDL, since it's in no way specific to Debian.

sebastic commented 2 years ago

Bare minimum would be to state that in the comment with which this issue is closed.

Upstreaming the changes from the Debian package is an option, although I'm not aware of any distros other than the BSDs also supporting those architectures.

mohawk2 commented 2 years ago

I didn't know about other platforms supporting those, though it's great they do. I feel like upstreaming that patch would ironically be bad by hiding information (and I don't know for sure that clang breaks in this way) - hopefully they'd report an issue.

I'll thank you again for the huge effort you've made in helping us fix the real problems this library had on those various architectures!

As of this writing, PDL::GSL::LINALG does not work on Debian on mips64el/mipsel, powerpc/ppc64, and sparc64 - it is believed this is due to an actual compiler bug. Therefore this issue is being closed. Please report any difficulties on here (since I recently learned non-maintainers can't reopen issues) and I'll reopen it, else please open a new one.

eli-schwartz commented 5 months ago

@sebastic you may want to check out #468 -- it contains other 64bit fixes, this time exposed by LTO passes (which are very good at tickling the optimizer, and also allows the diagnostic analysis to get a full-program view of the compiled library -- very useful for detecting latent bugs that in practice only produce incorrect results on non-amd64 hardware!!)