Closed sebastic closed 2 years ago
Thank you! We'll take a look.
Is there any way I could get a login to machines with these architectures, so I could run gdb
or similar to get a stack trace? The i386, armel, armhf Slatec failures look like they might be SEGVs, for a start.
Not easily, you'd need a sponsor for a guest account on the Debian porterboxes:
https://dsa.debian.org/doc/guest-account/
i386 you can easily test in a chroot or VM on an amd64 system.
arm64 and a few other architectures also have reasonably good emulation in QEMU, e.g;
https://wiki.debian.org/Arm64Qemu
Amazon may also still have the free tier for their arm64 VMs.
Thanks, that's a good start!
A common theme, including on ppc64, is:
Badvalue is set to 0 or 1. This will cause data loss when using badvalues for comparison operators. at ops.pd line 137.
PDL::Ops::__ANON__(PDL=SCALAR(0x1002b456840), 0, "") called at t/bool.t line 23
Despite the large number of failures and diverse architectures, my feeling (or at least hope) is this is a small number of errors.
Also, you wouldn't happen to have any insight into how to get any of these environments (at all) going with GitHub actions? I'd really like to catch this stuff in our CI. I'd also like to routinely run lintian
to catch those errors sooner too.
By the way, is there any value in PDL's current debian
directory existing? I've excluded it now from our MANIFEST
since I saw one of your patches does that, but I want it to be relevant / helpful, or I'll remove it entirely (or replace it if you'd like to have that as the canonical location for Debian stuff, which means we could CI it with every commit - do other projects do that?).
On that topic, I saw ironically that you recently dropped proj-bin
as a dep. With the new Proj v6+ support, it would be effective to add them back, have Alien::proj as a purely build-only dep, and get the Proj4 modules in Debian PDL again?
Despite the large number of failures and diverse architectures, my feeling (or at least hope) is this is a small number of errors.
Agreed. If you start by fixing the issues on i386 you likely have fixed those issues on the other 32bit architectures too.
Also, you wouldn't happen to have any insight into how to get any of these environments (at all) going with GitHub actions? I'd really like to catch this stuff in our CI.
I have no experience with GitHub actions, a little searching the docs finds me this which might be a good start:
https://github.com/marketplace/actions/run-on-architecture
I'd also like to routinely run lintian to catch those errors sooner too.
lintian only makes sense for Debian package builds, you could implement some of its tests in CI though.
By the way, is there any value in PDL's current debian directory existing? I've excluded it now from our MANIFEST since I saw one of your patches does that, but I want it to be relevant / helpful, or I'll remove it entirely (or replace it if you'd like to have that as the canonical location for Debian stuff, which means we could CI it with every commit - do other projects do that?).
If you want to build Debian packages yourself from the PDL source tree, there is still value keeping the directory.
It is not used for the package in Debian, we maintain our own packaging.
Since we have control over the infrastructure like Salsa, I will always maintain packaging there. I'm not an upstream developer, or even a user, which makes maintaining the packaging in the upstream repo make little sense.
If you get more involved with packaging PDL and its dependencies in Debian, you are in a much better position to maintain the upstream packaging.
On that topic, I saw ironically that you recently dropped proj-bin as a dep. With the new Proj v6+ support, it would be effective to add them back, have Alien::proj as a purely build-only dep, and get the Proj4 modules in Debian PDL again?
Alien::proj is not packaged for Debian, and since I don't actually use PDL and only keep the packaging on life support, I'm not willing to package it. This is why we had the pkg-config discussion to get the build flags in the other issue, it would remove the need for a new dependency to get packaged.
If I can build PDL with PROJ support by just reverting the removal the libproj-dev & proj-bin (build) dependencies I will do so. Based on the discussion I expected the pkg-config support to be available in PDL 2.064 and planned to revert the removal, but because it still requires Alien::proj I left it as-is.
At least arm64 has char
be unsigned. I've now specified SByte
to be explicitly unsigned char
, and released that as 2.068 - please let me know if that makes a difference. I believe that will pass at least on arm64, and possibly others. I haven't yet properly investigated the Slatec fails, which are next.
Regarding Slatec, be aware of these changes:
Thanks!
I believe that will pass at least on arm64, and possibly others.
Only arm64 so far:
https://buildd.debian.org/status/package.php?p=pdl&suite=experimental
I've updated Slatec and Minuit to be (hopefully) 64-bit aware, and will release that in due course.
Meanwhile, the FITS failures all show:
substr outside of string at /<<PKGBUILDDIR>>/IO/FITS/../../blib/lib/PDL/IO/FITS.pm line 895.
which is code that uses the get_dataref
mechanism to read directly into ndarray's data using substr
. This implies the sizes of ndarrays might be wrong, which might be due to them being of an unexpectedly small type.
This was in fact a small number of problems:
rfits
with compression exposed an equally-long-existing bug on big-endian systems: unconditional byte-swapping rather than checking for endian-nessSByte
type was a char
which on some systems is still unsigned, which with our policy of default badvalues on supposedly signed types being the MIN value led to the badvalue on SByte
being 0, which caused test failures due to warning in that situationt/autoload.t
was inappropriately precise, doing an exact-equality test on floating-point numbersI'm leaving this issue open because we haven't yet formally confirmed that this actually is all the problems yet, but any remaining ones will hopefully be small and easily solved.
Lessons learned along the way:
qemu
is a great tool for locally trying various architectures, though it's very slow, and actual CI is obviously a good idea; we might usefully set up jobs that only get run for tags, like Cygwin currentlypdlapi
(for the ABI), which will be very beneficial to mimic/learn from in packaging for Homebrew, and conceivably ChocolateyPackaging notes for myself:
pkg-config
call for nowlibpdl-graphics-plplot-perl
and libmodule-compile-perl
libx*
and x11proto-core-dev
build deps want a reviewr-pdl
conflicts stuff may be obsoletePDL::FAQ will want a review of its packaging / installation notes. Also, do we actually really need to dep on Term::ReadKey at all?
I did some extra checks for Slatec, using these techniques; first I replaced all the individual files in Libtmp/Slatec/Makefile
with:
FFLAGS=-fPIC -std=legacy -cpp
slatec/bigun$(OBJ_EXT): slatec/bigun.f
gfortran -c $(FFLAGS) -o slatec/bigun$(OBJ_EXT) slatec/bigun.f
Then iterated to find further problems and fix them using:
(cd Libtmp/Slatec/;
(for x in slatec/[c-z]*.f; do echo "#line 1 \"$x\""; cat $x; done) >bigun.f; mv bigun.f slatec/;
time make test
) |&less
which lets the compiler see all the Fortran at once and so detect problems, while with the -cpp
and #line
directives still show the right file and line numbers. Interestingly, because there are 112 files there, it takes something like 2s to compile them individually, but only about 1.1s to compile the concatenated file.
The most recent build shows failures for 32-bit (next one should be fine), but also GSL::LINALG failures on at least sparc64, ppc64; specifically the LU_det
returns 0 rather than the expected roughly 0.07+1e-9i. No idea why.
The latest problem is that t/transform.t
fails on 32-bit, on only one test. So far this is down to the way whichND
uses index
for slices to update columns of the resulting coordinates affecting more than just the indicated column, but only on 32-bit!
While the package now builds successfully on all release architectures, it only does so because PDL::GSL::LINALG is disabled on mips64el/mipsel, powerpc/ppc64, and sparc64.
To truly resolve this issue that should be fixed too.
That module is disabled on those platforms due to a probable compiler bug which (as you know) has been reported. Nothing at all can be done in this library to fix that. @sebastic Do you agree, and if so could you close this? As you can see in the linked comment above, this thing staying open is causing people to get an inaccurate picture of this library's stability.
Nothing has changed since my last comment, so I don't think this issue can be closed.
If you're unable or unwilling to truly resolve the issue on those architectures, please state so explicitly to inform any possible users of those architectures that they can't expect a fully functional PDL like on more popular architectures.
It's quite possible that there are zero actual users of PDL on those architectures, so it's quite acceptable to not spend more time on them.
When you say "explicitly stated", where do you have in mind? It certainly seems valid to move the patch from the Debian packaging into main PDL, since it's in no way specific to Debian.
Bare minimum would be to state that in the comment with which this issue is closed.
Upstreaming the changes from the Debian package is an option, although I'm not aware of any distros other than the BSDs also supporting those architectures.
I didn't know about other platforms supporting those, though it's great they do. I feel like upstreaming that patch would ironically be bad by hiding information (and I don't know for sure that clang breaks in this way) - hopefully they'd report an issue.
I'll thank you again for the huge effort you've made in helping us fix the real problems this library had on those various architectures!
As of this writing, PDL::GSL::LINALG does not work on Debian on mips64el/mipsel, powerpc/ppc64, and sparc64 - it is believed this is due to an actual compiler bug. Therefore this issue is being closed. Please report any difficulties on here (since I recently learned non-maintainers can't reopen issues) and I'll reopen it, else please open a new one.
@sebastic you may want to check out #468 -- it contains other 64bit fixes, this time exposed by LTO passes (which are very good at tickling the optimizer, and also allows the diagnostic analysis to get a full-program view of the compiled library -- very useful for detecting latent bugs that in practice only produce incorrect results on non-amd64 hardware!!)
The Debian package for PDL 2.066 fails to build on various architectures, see:
https://buildd.debian.org/status/package.php?p=pdl&suite=experimental