edoapra / fedpkg

My fork of pkgs.fedoraproject.org/rpms
0 stars 1 forks source link

ga/nwchem arch tracking #10

Closed marcindulak closed 1 year ago

marcindulak commented 4 years ago

ga https://github.com/edoapra/fedpkg/commit/83cac7356fe3f10563c9fda4b4afa9dfcb822349

https://dl.fedoraproject.org/pub/epel/6/

default archs https://koji.fedoraproject.org/koji/taskinfo?taskID=42415017: i386, x86_64

arch status link
i386 + https://koji.fedoraproject.org/koji/taskinfo?taskID=42414498
ppc64 BuildError: No matching arches were found https://koji.fedoraproject.org/koji/taskinfo?taskID=42414496
x86_64 + rpm

https://dl.fedoraproject.org/pub/epel/7/

default archs https://koji.fedoraproject.org/koji/taskinfo?taskID=42415019: ppc64le, x86_64

arch status link
aarch64 BuildError: No matching arches were found https://koji.fedoraproject.org/koji/taskinfo?taskID=42414490
ppc64 BuildError: No matching arches were found https://koji.fedoraproject.org/koji/taskinfo?taskID=42414492
ppc64le + https://koji.fedoraproject.org/koji/taskinfo?taskID=42414494
x86_64 + rpm

https://dl.fedoraproject.org/pub/epel/8/Everything/

default archs https://koji.fedoraproject.org/koji/taskinfo?taskID=42415025: aarch64, ppc64le, s390x, x86_64

arch status link
aarch64 + https://koji.fedoraproject.org/koji/taskinfo?taskID=42414398
ppc64le + https://koji.fedoraproject.org/koji/taskinfo?taskID=42414429
s390x 2 of 6 tests failed. https://koji.fedoraproject.org/koji/taskinfo?taskID=42414454
x86_64 + rpm

https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/Everything/ https://dl.fedoraproject.org/pub/fedora-secondary/development/rawhide/Everything/

default archs https://koji.fedoraproject.org/koji/taskinfo?taskID=42415031: armv7hl, i686, x86_64, ppc64le, s390x

arch status link
aarch64 + https://koji.fedoraproject.org/koji/taskinfo?taskID=42414329
armhfp No matching package to install: 'libibverbs-devel' https://koji.fedoraproject.org/koji/taskinfo?taskID=42414331
x86_64 + rpm
ppc64le + https://koji.fedoraproject.org/koji/taskinfo?taskID=42414337
s390x 2 of 6 tests failed. https://koji.fedoraproject.org/koji/taskinfo?taskID=42414339
edoapra commented 4 years ago

I have never worked with any s390x hardware. I would rather stick to arm and ppc64le The latest commit da469c1e01a50ffb74590d36be85edbc6d16c3c7 to ga.spec builds successfully on armv7hl, aarch64 and ppc64le for epel7, epel8, f32 and rawhide https://koji.fedoraproject.org/koji/taskinfo?taskID=42416254 https://koji.fedoraproject.org/koji/taskinfo?taskID=42416260 https://koji.fedoraproject.org/koji/taskinfo?taskID=42416281 https://koji.fedoraproject.org/koji/taskinfo?taskID=42416281

marcindulak commented 4 years ago

The https://src.fedoraproject.org/rpms/ga/commits/master is now at https://github.com/edoapra/fedpkg/commit/da469c1e01a50ffb74590d36be85edbc6d16c3c7 (note that you have more patch files in the fedpkg repo compared to src.fedoraproject.org). I think it's time to merge src.fedoraproject.org ga into your fedpkg master, then rebase (or merge) develop on master.

Can you provide a feedback el6 https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2020-dcba2e0d00 epel7 https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2020-019e867e64 epel8 https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2020-5372bcb47f f32 https://bodhi.fedoraproject.org/updates/FEDORA-2020-40f6a1e36a

Then we can move to nwchem builds.

edoapra commented 4 years ago

Could you push the latest patch to the fedora tree so that this simplifies my task? Thanks


From: marcindulak notifications@github.com Sent: Saturday, March 14, 2020 4:27 AM To: edoapra/fedpkg fedpkg@noreply.github.com Cc: Apra, Edoardo Edoardo.Apra@pnnl.gov; Comment comment@noreply.github.com Subject: Re: [edoapra/fedpkg] ga/nwchem arch tracking (#10)

The https://src.fedoraproject.org/rpms/ga/commits/master is now at da469c1https://protect2.fireeye.com/v1/url?k=41256295-1d905c5a-41254880-0cc47adc5e60-1d282b8e5f4330aa&q=1&e=089d2f10-d2a8-4220-b0d9-461a54979ac5&u=https%3A%2F%2Fgithub.com%2Fedoapra%2Ffedpkg%2Fcommit%2Fda469c1e01a50ffb74590d36be85edbc6d16c3c7 (note that you have more patch files in the fedpkg repo compared to src.fedoraproject.org). I think it's time to merge src.fedoraproject.org ga into your fedpkg master, then rebase (or merge) develop on master.

Can you provide a feedback el6 https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2020-dcba2e0d00 epel7 https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2020-019e867e64 epel8 https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2020-5372bcb47f f32 https://bodhi.fedoraproject.org/updates/FEDORA-2020-40f6a1e36a

Then we can move to nwchem builds.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://protect2.fireeye.com/v1/url?k=261121c5-7aa41f0a-26110bd0-0cc47adc5e60-d2998bd45985e908&q=1&e=089d2f10-d2a8-4220-b0d9-461a54979ac5&u=https%3A%2F%2Fgithub.com%2Fedoapra%2Ffedpkg%2Fissues%2F10%23issuecomment-599043536, or unsubscribehttps://protect2.fireeye.com/v1/url?k=fd1d2f60-a1a811af-fd1d0575-0cc47adc5e60-8c94bc35edb44d2e&q=1&e=089d2f10-d2a8-4220-b0d9-461a54979ac5&u=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABHSDINADPQHP7KU3NHVFQTRHNS2TANCNFSM4LGA3L3A.

edoapra commented 4 years ago

@marcindulak Forget my last comment, I did not read correctly your last posting

marcindulak commented 4 years ago

nwchem https://github.com/marcindulak/fedpkg/commit/f0de3a240f900d8310e3c4f289e6a8a9a0fc9784

epel6 OK on default archs https://koji.fedoraproject.org/koji/taskinfo?taskID=42610676: i686, x86_64

epel7 OK on default archs https://koji.fedoraproject.org/koji/taskinfo?taskID=42610687: x86_64, ppc64le

epel8 OK on aarch64: https://koji.fedoraproject.org/koji/taskinfo?taskID=42610712:

epel8 fails on ppc64le for tests/h2o2-prop-notrans/h2o2-prop-notrans epel8 fails on x86_64 for openmpi - the tests hang

rawhide OK on i686: archs https://koji.fedoraproject.org/koji/taskinfo?taskID=42610715

rawhide fails on armv7hl with Compiling basis.F... gfortran: error: unrecognized command-line option '-m64' rawhide fails on x86_64 for several tests rawhide fails on aarch64 for several tests rawhide fails on ppc64le for several tests

Note that a smaller number of tests fail on f32 (actually maybe just one on ppc64le). Failing the spec test stage does not stop the rpm from finishing the build (this is set on purpose this way).

The only blocker is rawhide/f32 on armv7hl, but having tests fail need to be probably investigated.

edoapra commented 4 years ago

@marcindulak my latest commit on top of your pull request might contain all the fixes needed to address the problems your tests have shown. My builds are running.

edoapra commented 4 years ago

@marcindulak I need to correct the previous posting. I was too optimistic for the ppc64le architecture. I thing I had failures on ppc64le, too. I am trying to investigate on the causes of this breakage

edoapra commented 4 years ago

The latest commit should address the ppc64le failures https://github.com/edoapra/fedpkg/commit/5b94ba65485258782a1f85f386a3eebdb8363197 The results of my koji builds look good now

edoapra commented 4 years ago

One more minor change https://github.com/edoapra/fedpkg/commit/fc842066088321000a481e837bd313899df77753 to avoid false negatives on rawhide/ppc64le

edoapra commented 4 years ago

@marcindulak, did you run this tests with the latest commit https://github.com/edoapra/fedpkg/commit/fc842066088321000a481e837bd313899df77753?

It should address most of the failures (false negatives) except for the epel6 i686.

I have just realized my koji builds do not seem to run the full set of tests of epel6/i686. Is there a time limit?

On 3/21/20 10:34 AM, marcindulak wrote:

With the above, there are some individual tests failing:

epel6 i686 openmpi https://koji.fedoraproject.org/koji/taskinfo?taskID=42660327

epel8 x86_64 mpich https://koji.fedoraproject.org/koji/taskinfo?taskID=42660326:

f33 armv7hl mpich https://koji.fedoraproject.org/koji/taskinfo?taskID=42660226

marcindulak commented 4 years ago

You noticed it, I didn't :). This is why I removed the above post and added another post (also removed) explaining that I missed this change. However, github mailing apparently didn't pick up my last removed comment properly, so it was actually better to leave the wrong comment in place.

The results with https://github.com/edoapra/fedpkg/commit/fc842066088321000a481e837bd313899df77753 I'm getting are:

epel6 i686 openmpi https://koji.fedoraproject.org/koji/taskinfo?taskID=42670660:

epel7 x86_64 mpich https://koji.fedoraproject.org/koji/taskinfo?taskID=42670658

f32 x86_64 mpich https://koji.fedoraproject.org/koji/taskinfo?taskID=42670655

f33 x86_64 mpich https://koji.fedoraproject.org/koji/taskinfo?taskID=42670621

We kill the tests after 30 minutes https://github.com/edoapra/fedpkg/blob/fc842066088321000a481e837bd313899df77753/nwchem/nwchem.spec#L433, the TIMEOUT_OPTS env variable is defined a bit above in the spec. I see that the runs on slow platforms like arm take 25 minutes, so we may need to increase this timeout. On epel6 only tests/h2o2-prop-notrans/h2o2-prop-notrans is missing, so we may get it to run after increasing the timeout (maybe to 45 minutes?), or it hangs.

We should also remove this part, this was used to disable some tests in the past https://github.com/edoapra/fedpkg/blob/fc842066088321000a481e837bd313899df77753/nwchem/nwchem.spec#L407-L414

edoapra commented 4 years ago

I am at complete loss at understand why the last commit fixes these QA issues for me and not for you. Adding HYDRA_DEBUG=0 remove the verbose output that confuses the parser for the QA tests for me but not for you. I am trying to investigate a bit more before trying to fix the parser, instead.

I am working at a fix for rhel6/i386, but that's a separate problem.

edoapra commented 4 years ago

Checked in a fix to the parser https://github.com/edoapra/fedpkg/commit/762093f8653625758860a23928f2e011aee2043a Please try it. I have no way to test it since in my koji builds the problem vanished after commit https://github.com/edoapra/fedpkg/commit/fc842066088321000a481e837bd313899df77753

edoapra commented 4 years ago

Now the koji builds for el6-candidate and epel7 don't even start

marcindulak commented 4 years ago

https://lists.fedoraproject.org/archives/list/packaging@lists.fedoraproject.org/thread/6Y6QG5HMTYQ5ZULWJSEZCLI2DUFITZRP/

edoapra commented 4 years ago

With koji back processing el6-candidate/epel7, I was able to successfully test the latest changes dc7f602b5a511fa00a5134c2131bc0c2f9a1aac6 I set NPROCS=1 for the i386 openmpi/rhel6 tests. It's not the most satisfactory solution, but with such an old version of openmpi in use, I am not very interested in trying to fix the bug that causes the QA test failure

marcindulak commented 4 years ago

I think we should still build mpich on rhel6, just without tests.

edoapra commented 4 years ago

I think we should still build mpich on rhel6, just without tests.

We don't run tests on the ga part, then same on the nwchem side ... not sure if we want to distributed RPMs that are not tested at all.

marcindulak commented 4 years ago

We could build rhel6/mpich and test manually taking the rpms. I started a build before your last change https://koji.fedoraproject.org/koji/taskinfo?taskID=42727298 - we could test the mpich rpms produced by it.

I recall doing this and I still distribute the older version of nwchem despite the "gethostbyname failed" tests failure https://koji.fedoraproject.org/koji/buildinfo?buildID=1102187

edoapra commented 4 years ago

My intention is to build RPMs that do require minimal maintenance for the foreseeable future and this does not seem a step in the right direction. Having a set of working tests is a requirement for this to happen. Please, let's exclude mpich from rhel6 to avoid things getting out of control

marcindulak commented 4 years ago

I see these are additional things to consider.

Currently tests which are failing won't fail the rpm build, so there is still a significant amount of work involved in looking at the tests. On the other hand if failed tests were to fail the rpm build that will put an additional maintenance burden since rawhide will be possibly breaking the tests by mass rebuilds.

epel6 has ga-5.7.2-3 https://dl.fedoraproject.org/pub/epel/6/x86_64/Packages/g/ - will it work with nwchem-6.8.1? If not, then I guess we need to build for mpich, but if ga-5.7.2-3 works with nwchem-6.8.1 in epel6 we may just leave it as is, and drop rhel6 support for new nwchem already now.

On the other hand, it seems like we should maintain epel6 only until November 30, 2020 https://access.redhat.com/support/policy/updates/errata, so this may be the last build.

edoapra commented 4 years ago

On the other hand, it seems like we should maintain epel6 only until November 30, 2020 https://access.redhat.com/support/policy/updates/errata, so this may be the last build.

That's exactly my point: why all these efforts for an O.S. that will be soon declared obsolete? Let's try to make something that does work now and on November 30

marcindulak commented 4 years ago

I've just run tests/h2o2-response on epel6 i686 and x86_64, with both openmpi and mpich with the existing ga-5.7.2-3 and nwchem-6.8.1 and verified the "Total DFT energy" agrees to 10^-8 with the reference. In order to get rid of "gethostbyname failed" I've appended the hostname to /etc/hosts localhost entry https://stackoverflow.com/questions/23112515/mpich2-gethostbyname-failed. This means the epel6 binaries provide some functionality.

Maybe the best solution now is to remove all rhel6 related logic from the spec and leave epel6 in the current state, without packaging nwchem-7.0.0 on epel6 at all?

edoapra commented 4 years ago

It's your call

edoapra commented 4 years ago

What I meant, I am tempted to drop rhel6 support in its entirety to remove this un-needed headache. Can't we just leave rhel6 with the nwchem 6.8.1 rpm and be happy with it? I did not realize this is what you suggested in your last sentence. Yes, let's leave epel6 with 6.8.1. End of the rhel6 saga.

marcindulak commented 4 years ago

This commit works for me, except on epel7 ppc64le tests seem to hang with mpich https://koji.fedoraproject.org/koji/taskinfo?taskID=42770726. Anyway I've submitted the fedpkg update for all distributions (epel7/epel8/f32), rawhide commit https://src.fedoraproject.org/rpms/nwchem/c/2bd8633c8c2ca9236f487f5d44b64cdafc93a244?branch=master

Can you provide karma? epel7 https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2020-47ed17c548 epel8 https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2020-c9e07d1ce2 f32 https://bodhi.fedoraproject.org/updates/FEDORA-2020-9f989fcee3

edoapra commented 4 years ago

Same timeout for me on epel7/ppc64le.

edoapra commented 4 years ago

Small update: nproc =1 for mpich/ppc64le avoids the QA tests timeouts https://github.com/edoapra/fedpkg/commit/e5173515647ccec33d1f0c194614c121ca8c7a35

marcindulak commented 4 years ago

I've included this change in the https://src.fedoraproject.org/rpms/nwchem master branch https://koji.fedoraproject.org/koji/taskinfo?taskID=42837226. After this build is successful, I think we can merge the develop back to master, and close both our issues in edoapra/fedpkg as "fixed".

edoapra commented 4 years ago

@marcindulak recent commits to https://src.fedoraproject.org/rpms/nwchem seems to indicate that openblas was replaced by flexiblas. The default for FlexiBLAS seems to be openblas-openmp ... therefore likely to result with a threaded BLAS execution with OMP_NUM_THREADS equal to number of this physical cores. This is likely too result in large performance degradation in most of NWChem modules given the moderate size of the matrices used in BLAS calls. Do you know if performance tests were executed before and after this change?

The following posting to fedora-devel seem to be consistent with my take on using FlexiBLAS in NWChem https://www.spinics.net/lists/fedora-devel/msg275547.html

marcindulak commented 4 years ago

The switch to flexiblas was made globally for fedora 33 by committing the change to the relevant repositories, e.g. https://src.fedoraproject.org/rpms/nwchem/c/982d2e4ddf70d299f0a8164c72f94259f715eec8?branch=master by the "FlexiBLAS as BLAS/LAPACK manager" project owner https://bugzilla.redhat.com/show_bug.cgi?id=1860504. I'm not expecting any additional tests were made apart from checking whether the affected packages were built.

I see two relatively similar builds of nwchem on fedora 33, one against openblas, another flexiblas. They are separated by 5 months (beginning of April vs end of August), and differ also on gfortran 10.0.1 vs 10.2.1.

https://koji.fedoraproject.org/koji/buildinfo?buildID=1488031 https://koji.fedoraproject.org/koji/buildinfo?buildID=1602750

Do you have a particular test that will show the performance problem? I'm inlining two dockerfiles, that can be used to test it (if you have a larger machine to test this, please do).

openblas:

FROM fedora:33@sha256:5acde95c3653f9412d64616c02cc1cd176a3e09049a6b22f8bf5395c52307d99

RUN set -x \
    && dnf install -y https://kojipkgs.fedoraproject.org//packages/nwchem/7.0.0/8.fc33/x86_64/nwchem-7.0.0-8.fc33.x86_64.rpm \
                      https://kojipkgs.fedoraproject.org//packages/nwchem/7.0.0/8.fc33/x86_64/nwchem-openmpi-7.0.0-8.fc33.x86_64.rpm \
                      https://kojipkgs.fedoraproject.org//packages/nwchem/7.0.0/8.fc33/noarch/nwchem-common-7.0.0-8.fc33.noarch.rpm \
    && dnf clean all

CMD ["/bin/bash"]

flexiblas:

FROM fedora:33@sha256:5acde95c3653f9412d64616c02cc1cd176a3e09049a6b22f8bf5395c52307d99

RUN set -x \
    && dnf install -y https://kojipkgs.fedoraproject.org//packages/nwchem/7.0.0/11.fc33/x86_64/nwchem-7.0.0-11.fc33.x86_64.rpm \
                      https://kojipkgs.fedoraproject.org//packages/nwchem/7.0.0/11.fc33/x86_64/nwchem-openmpi-7.0.0-11.fc33.x86_64.rpm \
                      https://kojipkgs.fedoraproject.org//packages/nwchem/7.0.0/11.fc33/noarch/nwchem-common-7.0.0-11.fc33.noarch.rpm \
    && dnf clean all

CMD ["/bin/bash"]

After building the docker image (e.g. docker build -t flexiblas .) a test can be executed like this (also by omitting OMP_NUM_THREADS)

time docker run --name flexiblas --rm -it -v "$(pwd):/mnt" flexiblas bash -c '. /etc/profile.d/modules.sh&& module use /usr/share/modulefiles&& module load mpi/openmpi-x86_64&& cd /mnt&& OMP_NUM_THREADS=1 mpiexec --allow-run-as-root -np 1 nwchem_openmpi h2.nw > h2.out'

After we confirm the performance degradation, should we export OMP_NUM_THREADS=1 in nwchem's /etc/profile.d/nwchem.[c]sh files, https://src.fedoraproject.org/rpms/nwchem/blob/982d2e4ddf70d299f0a8164c72f94259f715eec8/f/nwchem.spec#_283?

edoapra commented 4 years ago

@marcindulak thank you very much for providing these dockerfiles The input file that can be used is this one https://raw.githubusercontent.com/nwchemgit/nwchem/master/web/benchmarks/dft/siosi3.nw

This is the slightly modified command that I have used on a quad-core computer while use three processes via mpirun

/usr/bin/time -p docker run  --rm -it -v "$(pwd):/mnt" fedora33.flexiblas bash -c '. /etc/profile.d/modules.sh&& module use /usr/share/modulefiles&& module load mpi/openmpi-x86_64&& cd /mnt&& OMP_NUM_THREADS=1 mpiexec --allow-run-as-root -np 3 nwchem_openmpi siosi3.nw'

Wall-Timings in seconds

OMP_NUM_THREADS 1 not specified
OpenBLAS 43.4 43.4
FlexiBLAS 43.5 466.9

As you can see, you get a 10x slow-down when FlexiBLAS is used and OMP_NUM_THREADS is not set.

My recommendation is in line with what expressed in https://www.spinics.net/lists/fedora-devel/msg275547.html : as far as the maintenance of an effective NWChem package, I don't really see any advantage in using FlexiBLAS. Instead, I am afraid the performance degradation might be only the first of maintenance issues related to the use of FlexiBLAS in NWChem. Can't we stick to OpenBLAS? Is it going to be dropped by Fedora?

edoapra commented 4 years ago

NWChem Version 7.0.2 is out Updated nwchem.spec in develop branch https://github.com/edoapra/fedpkg/commit/e423cafea27f2925ef7e1b9771370057c8731a72

marcindulak commented 4 years ago

nwchem https://github.com/nwchemgit/nwchem/commit/55e74b08d431e19b762c088ae19ec3d3be5a4efe

it looks like there are some problems on fedora's 32-bit archs.

epel6 OK on the default archs i686, x86_64 with openmpi https://koji.fedoraproject.org/koji/taskinfo?taskID=53738634 . mpich runs crash with "gethostbyname failed". el6 looks good despite dropped support.

epel7 OK on the default archs x86_64 ppc64le https://koji.fedoraproject.org/koji/taskinfo?taskID=53711308

epel8 OK on the default archs aarch64 ppc64le x86_64 https://koji.fedoraproject.org/koji/taskinfo?taskID=53710623

f32 tests fail on the 32-bit platforms armv7hl i686 https://koji.fedoraproject.org/koji/taskinfo?taskID=53711619. I've tried to run with NPROC=1 on the 32-bit platforms, but the tests still failed.

f32 OK on x86_64 aarch64 ppc64le, same link as above

In addition to the test failures there is also this error on (probably) all platforms, but it's harmless I guess

grep -B 2 Error build.log | tail -3
rm -f -f *.o *.a *.mod *__genmod.f90 *core *stamp *trace mputil.mp* *events* *ipo *optrpt
rm: cannot remove 'paw_core': Is a directory
make[2]: [../../config/makelib.h:299: clean] Error 1 (ignored)

rawhide (f34, which uses flexiblas) behaves as f32 https://koji.fedoraproject.org/koji/taskinfo?taskID=53710481

I'm using flexiblas on fedora >= 33, and added OMP_NUM_THREADS=1 to the /etc/profile.d/nwchem.*sh scripts. I think it's better to follow what fedora does.

edoapra commented 4 years ago

Please move to the next commit on hotfix/release-7-0-0 https://github.com/nwchemgit/nwchem/commit/5d4a0e84c8f8d9656a0ac37e796a9a4eff8c5ad9

The current develop branch of https://github.com/edoapra/fedpkg builds correctly. I am not going to commit the Flexiblas change to my repository. Here are the logs of the develop builds https://koji.fedoraproject.org/koji/taskinfo?taskID=53601373 https://koji.fedoraproject.org/koji/taskinfo?taskID=53601873 https://koji.fedoraproject.org/koji/taskinfo?taskID=53601357 https://koji.fedoraproject.org/koji/taskinfo?taskID=53601332

edoapra commented 3 years ago

@marcindulak could you remove any mention of nwchem-sw.org and replace it with https://nwchemgit.github.io since nwchem-sw.org has been bought by cyber-squatters? https://github.com/edoapra/fedpkg/commit/05f75cc5d1b1393e9f53508201a574bebbe51bc5

marcindulak commented 3 years ago

I removed the mention of the old nwchem domain from the spec. Switched also from setting OMP_NUM_THREADS=1 in /etc/profile.d/nwchem.[c]sh to setting FLEXIBLAS=openblas-serial. This setting should result in ga also using openblas-serial. https://src.fedoraproject.org/rpms/nwchem/c/7a8cd7c1b2fe5f2cf3d0c9629025dc3b4816aafc?branch=master

The state of the rpm is now that on the next login after you install nwchem, the /etc/profile.d/nwchem.[c]sh scripts will set/export FLEXIBLAS=openblas-serial. Flexiblas has a cli, called flexiblas. This cli does not report the environment variable setting (using flexiblas print), but it can be indirectly verified that it takes the environment variable setting. This is because flexiblas will fallback to the fedora global default openblas-openmp if an inexistent FLEXIBLAS is set, printing a warning.

While testing the build, I experienced some very non-deterministic runs (scf converging vs not-converging) using a modified siosi3.nw example (replaced your explicit basis with STO-3G and removed "memory 450 mb noverify") with mpiexec -np 3 nwchem_openmpi siosi3.nw on a very resource limited f33 virtual machine (512MB RAM, virtual disk, nwchem swapping during the run), but I recall it happened in all 3 cases below:

I'm not sure this is something you've ever experienced, but it was happening every few runs. I was removing the created siosi3.* files before every new run.

edoapra commented 3 years ago

Could you upload the siosi3 output files? I have seen some problems with the 7.0.2 rpms myself. The epel8 openmpi rpms causes an early segv on a Centos8 VM. I have not looked into the details of the problem yet. The Centos8 problems were fixed by updating the VM RPMs

marcindulak commented 3 years ago

A non-deterministic run example https://github.com/nwchemgit/nwchem/issues/272

edoapra commented 3 years ago

I have just realized that the EPEL OpenBLAS RPMs are built with DYNAMIC_ARCH=1 I was running on a VM on a Skylake enabled CPU, but VirtualBox does not emulated AVX512 instruction; therefore, any BLAS call was failing. Setting

OPENBLAS_CORETYPE=HASWELL

fixed the problem. Even tough DYNAMIC_ARCH=1 seems to be in place for more than two years according to https://src.fedoraproject.org/rpms/openblas/c/7b9322f323d505dc62d44247a21b0d9905bbcbfd, I don't remember hitting this issue earlier ... One more point to add: Skylake instructions have been enabled in OpenBLAS not so long ago and a few bugs have been found https://github.com/xianyi/OpenBLAS/issues/2168 Do you have any idea of the hardrware configuration you are using for https://github.com/nwchemgit/nwchem/issues/272 ?

edoapra commented 3 years ago

Python 3.10 updates in https://github.com/edoapra/fedpkg/commit/608566800dd6c1aaac0a1d143b2d646b6e030883

marcindulak commented 1 year ago

I think we can also close this old issue.

edoapra commented 1 year ago

I removed the mention of the old nwchem domain from the spec. Switched also from setting OMP_NUM_THREADS=1 in /etc/profile.d/nwchem.[c]sh to setting FLEXIBLAS=openblas-serial. This setting should result in ga also using openblas-serial. https://src.fedoraproject.org/rpms/nwchem/c/7a8cd7c1b2fe5f2cf3d0c9629025dc3b4816aafc?branch=master

The state of the rpm is now that on the next login after you install nwchem, the /etc/profile.d/nwchem.[c]sh scripts will set/export FLEXIBLAS=openblas-serial. Flexiblas has a cli, called flexiblas. This cli does not report the environment variable setting (using flexiblas print), but it can be indirectly verified that it takes the environment variable setting. This is because flexiblas will fallback to the fedora global default openblas-openmp if an inexistent FLEXIBLAS is set, printing a warning.

While testing the build, I experienced some very non-deterministic runs (scf converging vs not-converging) using a modified siosi3.nw example (replaced your explicit basis with STO-3G and removed "memory 450 mb noverify") with mpiexec -np 3 nwchem_openmpi siosi3.nw on a very resource limited f33 virtual machine (512MB RAM, virtual disk, nwchem swapping during the run), but I recall it happened in all 3 cases below:

* `export FLEXIBLAS=openblas-serial`

* `export FLEXIBLAS=openblas-openmp`
  `export OMP_NUM_THREADS=1`

* `export LD_PRELOAD=/lib64/libopenblas.so.0` - I understand this should use openblas-serial bypassing flexiblas

I'm not sure this is something you've ever experienced, but it was happening every few runs. I was removing the created siosi3.* files before every new run.

Setting OMP_NUM_THREADS=1 should no longer be needed since from 7.0.2 onwards we use util_blas_set_num_threads() to set the number threads in threaded BLAS/LaPACK calls (and most of the time we set util_blas_set_num_threads(1)) . This should work with flexiblas, too. https://github.com/nwchemgit/nwchem/blob/master/src/util/util_blasthreads.F

marcindulak commented 1 year ago

OK, will unset it next build time. I actually added OMP_NUM_THREADS=1 recently https://src.fedoraproject.org/rpms/nwchem/c/39fc216ec8857b560436608db4c3524cfdc6353d?branch=rawhide when switching to flexiblas-openblas-openmp since flexiblas-openblas-serial is not avaiable in centos 9 stream https://bugzilla.redhat.com/show_bug.cgi?id=2182460