AcademySoftwareFoundation / openexr

The OpenEXR project provides the specification and reference implementation of the EXR file format, the professional-grade image storage format of the motion picture industry.
http://www.openexr.com/
BSD 3-Clause "New" or "Revised" License
1.62k stars 609 forks source link

Fedora Rawhide aarch64 - Test 4 fails #876

Open hobbes1069 opened 3 years ago

hobbes1069 commented 3 years ago

I'm working on upgrading the OpenEXR stack on Fedora but ran into a strange arch specific issue. The x86_64 build passes all tests.

Test project /builddir/build/BUILD/openexr-2.5.3/aarch64-redhat-linux-gnu
    Start 1: IlmBase.Half
    Start 2: IlmBase.Iex
    Start 3: IlmBase.Imath
    Start 4: OpenEXR.IlmImf
1/8 Test #2: IlmBase.Iex ..........................   Passed    0.01 sec
    Start 5: OpenEXR.IlmImfUtil
2/8 Test #1: IlmBase.Half .........................   Passed    2.43 sec
    Start 6: PyIlmBase.PyIexTest_Python3
3/8 Test #6: PyIlmBase.PyIexTest_Python3 ..........   Passed    0.09 sec
    Start 7: PyIlmBase.PyImathTest_Python3
4/8 Test #7: PyIlmBase.PyImathTest_Python3 ........   Passed    0.44 sec
    Start 8: PyIlmBase.PyImathNumpyTest_Python3
5/8 Test #8: PyIlmBase.PyImathNumpyTest_Python3 ...   Passed    0.44 sec
6/8 Test #3: IlmBase.Imath ........................   Passed    7.55 sec
7/8 Test #5: OpenEXR.IlmImfUtil ...................   Passed   50.38 sec
8/8 Test #4: OpenEXR.IlmImf .......................Child aborted***Exception: 766.99 sec
tempDir = /var/tmp/IlmImfTest_QRIFCKRB

https://download.copr.fedorainfracloud.org/results/hobbes1069/openexr/fedora-rawhide-aarch64/01823533-openexr/builder-live.log.gz

peterhillman commented 3 years ago

I had a glance at the log but it now has disappeared. I believe it may have been testBackwardCompatibility that failed. You can run the test manually with bin/IlmImfTest testBackwardCompatibility (or just bin/IlmImfTest to run them all if that passes)

That particular test requires reading an existing file and comparing to a temporary file. The assert message should give both paths. You might confirm that the files are present and readable to the device running the test, and if so, whether they are identical. If they are different, perhaps you could attach them.

Guessing possible causes: the reference images exist in the source tree, in the IlmImfTest folder, so when cross-compiling the files may not be readable to the machine running the test. ILM_IMF_TEST_IMAGEDIR can be set at compile time to specify the path that the executing machine should use to read the test images. Also, these images have moved between 2.5.3 and the master branch (to src/test/OpenEXRTest), and the test renamed OpenEXRTest

hobbes1069 commented 3 years ago

It's also failing for s390x. I'm trying to get access to the aarch64 test instance. Is it possible to add any flags/options that provide more verbose output from a package build to get the same information?

hobbes1069 commented 3 years ago

In this case we're not cross-compiling, it's all built on native hardware (or in a VM worst case).

I've kicked off a scratch build if you want to look at the logs once they complete or fail.

https://koji.fedoraproject.org/koji/taskinfo?taskID=57650850

peterhillman commented 3 years ago

Thanks for that - this is the error message from the build.log

Running testBackwardCompatibility
Testing backward compatibility
ERROR -- caught exception: v1.7 and current differences between '/var/tmp/IlmImfTest_CBTOGVNI/v1.7.test.planar.exr' & '/builddir/build/BUILD/openexr-2.5.3/OpenEXR/IlmImfTest/v1.7.test.planar.exr'
IlmImfTest: /builddir/build/BUILD/openexr-2.5.3/OpenEXR/IlmImfTest/testBackwardCompatibility.cpp:395: void testBackwardCompatibility(const string&): Assertion `false' failed.

It appears that specific message would only appear if both files are readable and differ. If possible, it would be helpful to get a copy of the contents of /var/tmp/IlmImfTest_*/ (i.e. the files written by the test).

I will expand the test's error message to indicate what the difference between the two files is, which may help to debug such issues. Also, I notice the test doesn't check that the files are the same size.

hobbes1069 commented 3 years ago

Ok, I don't have an immediate answer then. I have someone shipping me a Jetson Nano to install Fedora aarch64 on but it will likely be at least next week before I get it or have time to play with it.

hobbes1069 commented 3 years ago

I found a way to build the package using qemu and aarch64 virtualization, so after over 5 hours here's the result :

https://www.dropbox.com/s/0nhf027gzcpp063/imf_test_copy.exr

peterhillman commented 3 years ago

@hobbes1069 thanks for that file. This appears to be a file from a different test than the one that reported failure in the build logs. testTiledLineOrder writes a file called imf_test_copy.exr which is consistent with this file. This file is also reported incomplete. I was expected a file called v1.7.test.planar.exr

I wonder if there was a completely different failure case this time. If so, we'd need the test log messages to help understand the issue. Is it possible that 5 hours was just too long and the test simply aborted before it finished?

hobbes1069 commented 3 years ago

Ok, that was the only file in the directory... So if it doesn't write the file it will be different from the original, no? :)

I just got a Nvidia Jetson Nano gifted by the Fedora Project and just recently got it up and going. Building openexr now. Hopefully faster than emulating.

peterhillman commented 3 years ago

Successful tests clean up files, so any files in the temporary directory are related to failed tests. My reading of the issue is that imf_test_copy.exr was left because testTiledLineOrder failed. This runs before testBackwardCompatibility, so that would not have had a chance to run. I'm guessing that testTiledLineOrder failed due to something other than an issue with the library or the test suite (maybe a timeout/out of memory/etc). With direct access to hardware, running bin/IlmImfTest testTiledLineOrder and bin/IlmImfTest testBackwardCompatibility should be a way of isolating each of the troublesome tests.

hobbes1069 commented 3 years ago

Well so far the 4GB memory in the Jeston Nano has not been enough to build OpenEXR. I'm trying to either reduce the number of parallel jobs in make or extend the swap space available.

hobbes1069 commented 3 years ago

testTiledLineOrder completes without error.

Ahh... Looks like we're hitting an assertion.

# ./IlmImfTest testBackwardCompatibility
tempDir = /var/tmp/IlmImfTest_HILEPYMX

=======
Running testBackwardCompatibility
Testing backward compatibility
ERROR -- caught exception: v1.7 and current differences between '/var/tmp/IlmImfTest_HILEPYMX/v1.7.test.planar.exr' & '/builddir/build/BUILD/openexr-2.5.3/OpenEXR/IlmImfTest/v1.7.test.planar.exr'
IlmImfTest: /builddir/build/BUILD/openexr-2.5.3/OpenEXR/IlmImfTest/testBackwardCompatibility.cpp:395: void testBackwardCompatibility(const string&): Assertion `false' failed.
Aborted (core dumped)
peterhillman commented 3 years ago

Can you provide your /var/tmp/IlmImfTest_HILEPYMX/v1.7.test.planar.exr?

I tried a Raspberry Pi 4 Model B running 64 bit Ubuntu 20.10 and all tests passed. (Running through ctest did trigger a timeout, but running IlmImfTest directly was fine)

I will try with Fedora 33

hobbes1069 commented 3 years ago

https://www.dropbox.com/s/j6xn5tb1n71xu06/v1.7.test.planar.exr

peterhillman commented 3 years ago

Thanks! It seems like the zlib compressed data differs between aarch64 and x86_64 on fedora (though apparently not Ubuntu). I wonder if the compression level of Z_DEFAULT_COMPRESSION differs for some reason. If I can get fedora aarch64 to boot I can run more detailed tests. It would be nice to have binary equivalence between files regardless of the processor architecture, though that's not such a critical requirement. Maybe testBackwardCompatibility should use uncompressed files for comparison to circumvent these issues.

hobbes1069 commented 3 years ago

PM me if you'd like me to setup a login on my Jetson Nano. I can have all the build deps installed ahead of time.

hobbes1069 commented 3 years ago

I've been holding up building official packages for Fedora until this got straightened out, but it doesn't sound like it's a big deal. Do you concur?

peterhillman commented 3 years ago

I'm not sure whether this is an issue. The file produced by testBackwardCompatibility does seem to be backward compatible, but the test inadvertently revealed binary differences between files produced by fedora aarch64 and x86_64. It would be good at least to understand and document those differences. The files produced are readable by other architectures, so it doesn't seem to be a serious issue.

I personally won't have time to look into this personally for a couple of weeks. One useful experiment would be to see if earlier versions of OpenEXR pass testBackwardCompatibility. Looking at the fedora zlib-devel rpm file, it does have arm-specific patches, and also options to override the compression level when the zlib library is compiled. They could be responsible for the difference, rather than anything in the OpenEXR library itself.

hobbes1069 commented 3 years ago

It looks like earlier versions did not have this issue, but the last version built for Fedora is 2.3.0, due to the inclusion of ilmbase et all.

https://kojipkgs.fedoraproject.org//packages/OpenEXR/2.3.0/7.fc34/data/logs/aarch64/build.log

The maintainer didn't have time to deal with the change so I stepped in. I will be obsoleting both the OpenEXR and ilmbase packages in Fedora and providing "openexr" in it's place.

hobbes1069 commented 3 years ago

How hard would it be to come up with a patch to test an uncompressed comparison?

peterhillman commented 3 years ago

I've made some progress, getting Fedora Minimal working on a Raspberry Pi 4B, and reproduced the issue. The problem seems to be the way that the zlib library is built for Fedora on aarch64.

I downloaded zlib-1.2.11, compiled it from source, installed it in a different location, then updated LD_LIBRARY_PATH to pick up my new zlib.so. testBackwardCompatibility now succeeds. I don't know much more about this, but it does seem like the arm-specific patches to zlib are making binary differences in the compressed data.

It would be good to hear from someone who understands those patches: is data compressed by Fedora aarch64 guaranteed to be readable by any other architecture, and vice versa? Is there a way (optionally at least) to guarantee identical binary-compressed data with these patches in place?

If the differences are unavoidable then making testBackwardCompatibility use uncompressed data would circumvent the error message, but potentially hide more serious issues like this in the future.

A temporary fix might be for fedora aarch64 to patch IlmImfTest/main.cpp to comment out the call to testBackwardCompatibility()

hobbes1069 commented 3 years ago

Thanks for digging in. It looks like this is a known issue for some time unfortunately...

https://bugzilla.redhat.com/show_bug.cgi?id=1665221

For now I've disabled testing on aarch64 and s390x (which I have no ability to test on).

jlinton commented 3 years ago

Yes, I'm the guy partially responsible for the zlib acceleration patches. To answer the general question, yes the output data will vary slightly on aarch64, due to the use of neon/vector registers, but it still conforms to the zlib format. This means as you have discovered the files can be decompressed on other architectures/etc.

This is the expected behavior of most compression libraries (the stream is defined, but the actual match vs literal tokens in the compressed stream may vary from release to release or based on size vs compression speed options).

And as noted, unit tests wishing to verify the compressor should decompress the data and verify it matches the original rather than comparing the compressed data with another source. Its only because zlib has remained unchanged for most of a couple decades that this works at all. This is both zlibs strong point, but also when compared with more recent comrpession libraries its weak point. Given that it simultaneously gets beaten on speed+compression ratio tests by more modern implementations that use 32-bit+ comparison functions/etc.

Also with respect to testing, qemu can run arm32/64 containers or random binaries with the linux binfmt_misc options which can emulate a target arch for a single process. This allows one to say, write a file with the x86 version of a program and then CI/test it with a program compiled for another arch.

https://ownyourbits.com/2018/06/13/transparently-running-binaries-from-any-architecture-in-linux-with-qemu-and-binfmt_misc/

limburgher commented 2 years ago

Failing tests on i686 and ppc64le.

meshula commented 2 years ago

@limburgher ~ thanks for raising it for attention. To get proper visibility outside of this specific triple of Fedora/Rawhide/aarch64, it might be a good idea to open a new issue about the OS/release/architectures you're encountering a failure on, and also post some logs to help with diagnosis.

hobbes1069 commented 1 year ago

I think this can be closed, but still have issues with ppc64le. s390x builders are down for maintenance but I'm trying a qemu emulation build to see if #1175 is still an issue.