OpenMathLib / OpenBLAS

OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.
http://www.openblas.net
BSD 3-Clause "New" or "Revised" License
6.39k stars 1.5k forks source link

Parallel Build Failing on Distributed Filesystem #2973

Closed j-w-jones closed 2 years ago

j-w-jones commented 4 years ago

I'm trying to compile OpenBLAS 0.3.12 using GCC 10.2.0.

When it gets to the linktest part it fails with unresolved symbols. And when I look in the .so file using 'nm' I can see a load of symbols are undefined.

perl ./gensymbol linktest x8664 0 0 0 0 0 0 "" "" 1 0 1 1 1 1 > linktest.c gcc -O2 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=80 -DMAX_PARALLEL_NUMBER=1 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.12\" -msse3 -mssse3 -msse4.1 -march=skylake-avx512 -mavx2 -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHARCNAME -DASMNAME= -DASMFNAME= -DNAME=_ -DCNAME= -DCHARNAME=\"\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I.. -shared -o ../libopenblas_skylakexp-r0.3.12.so \ -Wl,--whole-archive ../libopenblas_skylakexp-r0.3.12.a -Wl,--no-whole-archive \ -Wl,-soname,libopenblas.so.0 -lm -lpthread -lgfortran -lm -lpthread -lgfortran gcc -O2 -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DNO_WARMUP -DMAX_CPU_NUMBER=80 -DMAX_PARALLEL_NUMBER=1 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.12\" -msse3 -mssse3 -msse4.1 -march=skylake-avx512 -mavx2 -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHARCNAME -DASMNAME= -DASMFNAME= -DNAME=_ -DCNAME= -DCHARNAME=\"\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I.. -w -o linktest linktest.c ../libopenblas_skylakexp-r0.3.12.so -L/opt/software/base/gcc/10.2.0/lib/gcc/x86_64-pc-linux-gnu/10.2.0 -L/opt/software/base/gcc/10.2.0/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../lib64 -L/lib/../lib64 -L/usr/lib/../lib64 -L/opt/software/base/gcc/10.2.0/lib/gcc/x8664-pc-linux-gnu/10.2.0/../../.. -lgfortran -lm -lquadmath -lm -lc && echo OK. /tmp/ccaHbXUr.o: In function main': linktest.c:(.text.startup+0xf88): undefined reference toslagge' linktest.c:(.text.startup+0xf8f): undefined reference to slagsy_' linktest.c:(.text.startup+0xf96): undefined reference toslahilb' linktest.c:(.text.startup+0xf9d): undefined reference to `slakf2' linktest.c:(.text.startup+0xfa4): undefined reference to slaran_' linktest.c:(.text.startup+0xfab): undefined reference toslarge' linktest.c:(.text.startup+0xfb2): undefined reference to `slarnd'

However, if I turn off parallel make then it builds fine.

I have 40 cores in my server so could there be an issue with the make dependencies causing the dynamic library to be build from the statis library before the static library has finished being built?

martin-frbg commented 4 years ago

This is a bit weird, but starting a make in exports while the static build is still running looks like the only explanation. (Wonder why that has not come up before, unless everybody is silently limiting their make to just a handful of parallel jobs). Could you check if adding shared to the .NOTPARALLEL in line 40 of the toplevel Makefile fixes this ?

j-w-jones commented 4 years ago

Hi Martin

I have tried that, but it didn’t make a difference I’m afraid.

I have run ‘nm’ on libopenblas_skylakexp-r0.3.12.so from a sequential build and a parallel build. They’re attached.

You can see that some symbols are undefined in the parallel one, but defined in the sequential one, for example: “slagge_”.

Regards,

Jason

-- Dr Jason W Jones Associate Professor College of Engineering Swansea University Singleton Park Swansea UK SA2 8PP Tel: +44-1792-295869

[cid:image001.jpg@01D6B377.8A6F5300]

From: Martin Kroeker notifications@github.com Sent: 05 November 2020 12:35 To: xianyi/OpenBLAS OpenBLAS@noreply.github.com Cc: Jones J.W. J.W.Jones@Swansea.ac.uk; Author author@noreply.github.com Subject: Re: [xianyi/OpenBLAS] Parallel Build Failing (#2973)

This is a bit weird, but starting a make in exports while the static build is still running looks like the only explanation. (Wonder why that has not come up before, unless everybody is silently limiting their make to just a handful of parallel jobs). Could you check if adding shared to the .NOTPARALLEL in line 40 of the toplevel Makefile fixes this ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxianyi%2FOpenBLAS%2Fissues%2F2973%23issuecomment-722350475&data=04%7C01%7CJ.W.Jones%40Swansea.ac.uk%7Caedb59580626423e5e6208d8818747e1%7Cbbcab52e9fbe43d6a2f39f66c43df268%7C0%7C1%7C637401765309498952%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=VGOIwGif0WNFc489ezkvI6a9X5mRG%2FOqwjDqwKxmO44%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABMGLFLZBZW6RTRFOOZD65DSOKLXTANCNFSM4TLIMH3A&data=04%7C01%7CJ.W.Jones%40Swansea.ac.uk%7Caedb59580626423e5e6208d8818747e1%7Cbbcab52e9fbe43d6a2f39f66c43df268%7C0%7C1%7C637401765309498952%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=3MB2ziAmUsh2a%2Fr1oaDp84lcBm2bX1cLs345UbOwqfs%3D&reserved=0.

martin-frbg commented 4 years ago

Hi Jason, unfortunately the nm output was not attached. The symbols you mentioned above, slagge, slagsy et al. all belong to TESTING/MATGEN, I'm wondering if we need .NOTPARALLEL for the entire lapack-netlib hierarchy (or at least in the Makefile in lapack-netlib/TESTING/MATGEN). Still strange that this has not come up before - some of the CI jobs run on Epyc or ThunderX hardware, need to check if these use make -j though...

martin-frbg commented 4 years ago

Not reproduced on a 96core ARM server, nor on a 48core AMD Epyc

brada4 commented 4 years ago

Longer build log would be necessary, especially errors around slag* functions. Intel makes no 40-core CPU I assume it is 20core some-lake platinum?

j-w-jones commented 4 years ago

Hi Andrew

I redid the built in parallel but in /tmp, which is a normal SAS disk and it compiled fine.

The original builds I was doing were on a large Lustre filesystem. I know Lustre tends to cache more and flush to the disk far less but that shouldn’t affect any files that have been closed, i.e. the process using them has finished.

It does mean two processes writing to the same file, even if the actual writes are separated in time is far less likely to succeed than for a normal disk.

I did try to see if I could get ‘make’ to print out which build steps where in which threads, or maybe print a timestamp for each build but I couldn’t find anything.

The server has 2 x 20 Intel Xeon Gold 6230 cpus.

Cheers,

Jason

From: Andrew notifications@github.com Sent: 06 November 2020 13:05 To: xianyi/OpenBLAS OpenBLAS@noreply.github.com Cc: Jones J.W. J.W.Jones@Swansea.ac.uk; Author author@noreply.github.com Subject: Re: [xianyi/OpenBLAS] Parallel Build Failing (#2973)

Longer build log would be necessary, especially errors around slag* functions. Intel makes no 40-core CPU I assume it is 20core some-lake platinum?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxianyi%2FOpenBLAS%2Fissues%2F2973%23issuecomment-723070268&data=04%7C01%7CJ.W.Jones%40Swansea.ac.uk%7C7fd23f81827447dfc5ac08d882549100%7Cbbcab52e9fbe43d6a2f39f66c43df268%7C0%7C0%7C637402647001036297%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=vlf4zQpIPlTsBZIsldHZBG95%2FuFDh%2BZbKhvHofbGA6I%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABMGLFOJB3JB2NJ3T2J5LBDSOPX65ANCNFSM4TLIMH3A&data=04%7C01%7CJ.W.Jones%40Swansea.ac.uk%7C7fd23f81827447dfc5ac08d882549100%7Cbbcab52e9fbe43d6a2f39f66c43df268%7C0%7C0%7C637402647001036297%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=L%2BxB8c5chj5Go43VWte0xgFb7v7UJbhboHJidaXuUUo%3D&reserved=0.

martin-frbg commented 4 years ago

That would appear to be a consistency issue with your distributed filesystem then (or perhaps gmake's inability to handle such a case). I still have not seen this with the drone.io CI and whatever fs backend they use, where I believe any trivial cases of missing make dependencies should show up. Is it always the same (small?) set of missing functions your (don't) see ?

brada4 commented 4 years ago

In general make relies on very accurate timestamps, in case of NFS and seemingly lustre those are set by backend which may have time off more than timestamp resolution. Probably you got some warnings in regard to timestamps when building. One solution is to assure very accurate time synchronisation, dev/shm is kind of gratis filesystem with local timestamps, whough mounting /tmp same way looks more legit.

j-w-jones commented 4 years ago

Hi Andrew,

The timestamps are fine – make never complains about this. I have had this in the past with NFS mounted filesystems where the client and server clocks were out of sync.

I am wondering if two threads are updating the archive library at similar times (for example within a second of each other) and a normal, local disk deals with it better.

Anyway, I guess you can close this issue. If I do have time to investigate further and find anything I will let you know.

Cheers,

Jason

From: Andrew notifications@github.com Sent: 10 November 2020 11:55 To: xianyi/OpenBLAS OpenBLAS@noreply.github.com Cc: Jones J.W. J.W.Jones@Swansea.ac.uk; Author author@noreply.github.com Subject: Re: [xianyi/OpenBLAS] Parallel Build Failing (#2973)

In general make relies on very accurate timestamps, in case of NFS and seemingly lustre those are set by backend which may have time off more than timestamp resolution. Probably you got some warnings in regard to timestamps when building. One solution is to assure very accurate time synchronisation, dev/shm is kind of gratis filesystem with local timestamps, whough mounting /tmp same way looks more legit.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxianyi%2FOpenBLAS%2Fissues%2F2973%23issuecomment-724654729&data=04%7C01%7CJ.W.Jones%40Swansea.ac.uk%7C5dcf87a5c29d4101fd5108d8856f6c54%7Cbbcab52e9fbe43d6a2f39f66c43df268%7C0%7C0%7C637406060886441843%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BAiP3dItJ7RkYX3DJ%2BzPJnEwCPSP8%2B%2FAmIeY9324LZo%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABMGLFK7VFWCVL3BNAAIH5LSPESYBANCNFSM4TLIMH3A&data=04%7C01%7CJ.W.Jones%40Swansea.ac.uk%7C5dcf87a5c29d4101fd5108d8856f6c54%7Cbbcab52e9fbe43d6a2f39f66c43df268%7C0%7C0%7C637406060886451836%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=OGzBSCHg04MFasKJzr3yjbqeaQZlFO3iBXp4VRzJc3Y%3D&reserved=0.

martin-frbg commented 4 years ago

Unfortunately I do not have access to a lustre-based DFS, so the best option seems to be to mention this topic in the wiki and/or README. (Need to look into Makefile debugging with something like https://github.com/rocky/remake)

brada4 commented 4 years ago

make complains only if it sees source files timestamped in future, e.g. you do something CI style on the cluster - expand tarball on the node living in future, then compile from same place on other node which sees unlikely future files. It is not only NFS problem, sometimes reboot and misguided timezone setups trigger that.

You will diagnose unwillingly building next software package. Lustre says to use NTP, please ask your admin to check. https://wiki.lustre.org/Operating_System_Configuration_Guidelines_For_Lustre#Date_and_Time_Synchronization_with_NTP

j-w-jones commented 4 years ago

Hi Andrew

I am the system admin and the cluster does use NTP. Lustre would be throwing errors regularly if clocks were not synced.

Regards,

Jason

From: Andrew notifications@github.com Sent: 10 November 2020 15:39 To: xianyi/OpenBLAS OpenBLAS@noreply.github.com Cc: Jones J.W. J.W.Jones@Swansea.ac.uk; Author author@noreply.github.com Subject: Re: [xianyi/OpenBLAS] Parallel Build Failing (#2973)

make complains only if it sees source files timestamped in future, e.g. you do something CI style on the cluster - expand tarball on the node living in future, then compile from same place on other node which sees unlikely future files. It is not only NFS problem, sometimes reboot and misguided timezone setups trigger that.

You will diagnose unwillingly building next software package. Lustre says to use NTP, please ask your admin to check. https://wiki.lustre.org/Operating_System_Configuration_Guidelines_For_Lustre#Date_and_Time_Synchronization_with_NTPhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.lustre.org%2FOperating_System_Configuration_Guidelines_For_Lustre%23Date_and_Time_Synchronization_with_NTP&data=04%7C01%7CJ.W.Jones%40Swansea.ac.uk%7Ca50d45ef770c4ac59c8908d8858ec4c0%7Cbbcab52e9fbe43d6a2f39f66c43df268%7C0%7C0%7C637406195529886471%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Xs%2BQ5n5YPjAq3ifkXNjJwGIuVR0AKMA491aZV3nLgJA%3D&reserved=0

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fxianyi%2FOpenBLAS%2Fissues%2F2973%23issuecomment-724782641&data=04%7C01%7CJ.W.Jones%40Swansea.ac.uk%7Ca50d45ef770c4ac59c8908d8858ec4c0%7Cbbcab52e9fbe43d6a2f39f66c43df268%7C0%7C0%7C637406195529886471%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=KZhIo5J48zqBEFwBqoAsmjr76tnh2MWFcEFdian682Y%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABMGLFJVHTTO5S4ZWML2S73SPFNBXANCNFSM4TLIMH3A&data=04%7C01%7CJ.W.Jones%40Swansea.ac.uk%7Ca50d45ef770c4ac59c8908d8858ec4c0%7Cbbcab52e9fbe43d6a2f39f66c43df268%7C0%7C0%7C637406195529896465%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=TCpesuixzjt9255JlcHXzJHnWcsdrHV9cz7saMhpEiY%3D&reserved=0.

brada4 commented 4 years ago

You can always provide native OS package built on a better filesystem. OpenBLAS does not do anything specific with make to push it out of normal operation. NFS with bad times is known to fail, Lustre may or may not fail same way. Full log (like script) should tell more about destinies of missing files. Please delete original message if replying this via e-mail

martin-frbg commented 4 years ago

Err, @brada4, can you give this a rest please ?

jamtrott commented 2 years ago

I have been experiencing this issue for parallel builds on a dual-socket AMD Epyc system with 64 cores and a BeeGFS distributed, shared filesystem. It was easily reproducible in my setup. But I tried, as suggested in @martin-frbg's comment, to add a .NOTPARALLEL: line to the file lapack-netlib/LAPACKE/src/Makefile. This appears to have worked, and I no longer see the issue.

Maybe it's not a suitable fix, but at least it can serve as a workaround for now.

martin-frbg commented 2 years ago

Thanks for that report. So you were seeing the problem with symbols from LAPACKE only, while the original poster was missing functions from MATGEN ?

brada4 commented 2 years ago

If you set NFS server date like 20 seconds in future you get that with anything involving any kind of make. Anyone can afford 1GB ramdisk for all their build needs.

jamtrott commented 2 years ago

Here is the relevant part of the build log, which shows the list of symbols that are missing. Let me know, and I can include the entire log, if needed.

make[2]: Entering directory '/global/D1/homes/james/ex3modules/defq/1.0.0/src/openblas-0.3.12/exports'
perl ./gensymbol linktest  x86_64 _ 0 0 0 0 0 0 "" "" 1 0 1 1 1 1 > linktest.c
cc -O2 -DMAX_STACK_ALLOC=2048 -DUSE_LOCKING -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DDYNAMIC_ARCH -DNO_WARMUP -DMAX_CPU_NUMBER=256 -DMAX_PARALLEL_NUMBER=1 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.12\" -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I..  -shared -o ../libopenblas-r0.3.12.so \
-Wl,--whole-archive ../libopenblas-r0.3.12.a -Wl,--no-whole-archive \
-Wl,-soname,libopenblas.so.0 -lm -lgfortran -lm -lgfortran
cc -O2 -DMAX_STACK_ALLOC=2048 -DUSE_LOCKING -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DDYNAMIC_ARCH -DNO_WARMUP -DMAX_CPU_NUMBER=256 -DMAX_PARALLEL_NUMBER=1 -DBUILD_SINGLE=1 -DBUILD_DOUBLE=1 -DBUILD_COMPLEX=1 -DBUILD_COMPLEX16=1 -DVERSION=\"0.3.12\" -UASMNAME -UASMFNAME -UNAME -UCNAME -UCHAR_NAME -UCHAR_CNAME -DASMNAME= -DASMFNAME=_ -DNAME=_ -DCNAME= -DCHAR_NAME=\"_\" -DCHAR_CNAME=\"\" -DNO_AFFINITY -I..  -w -o linktest linktest.c ../libopenblas-r0.3.12.so -L/cm/shared/apps/slurm/20.02.7/lib/x86_64-linux-gnu -L/cm/shared/apps/slurm/20.02.7/lib/../lib -L/cm/shared/apps/slurm/20.02.7/lib64/../lib -L/cm/shared/apps/slurm/20.02.7/lib64/../lib -L/usr/lib/gcc/x86_64-linux-gnu/7 -L/usr/lib/gcc/x86_64-linux-gnu/7/../../../../x86_64-linux-gnu/lib/../lib -L/usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu -L/usr/lib/gcc/x86_64-linux-gnu/7/../../../../lib -L/lib/x86_64-linux-gnu -L/lib/../lib -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib -L/cm/shared/apps/slurm/20.02.7/lib -L/cm/shared/apps/slurm/20.02.7/lib64 -L/cm/shared/apps/slurm/20.02.7/lib64/slurm -L/cm/shared/apps/slurm/20.02.7/lib64 -L/usr/lib/gcc/x86_64-linux-gnu/7/../../../../x86_64-linux-gnu/lib -L/usr/lib/gcc/x86_64-linux-gnu/7/../../..  -lgfortran -lm -lquadmath -lm -lc   && echo OK.
/tmp/cc3YEbzE.o: In function `main':
linktest.c:(.text.startup+0xf88): undefined reference to `slagge_'
linktest.c:(.text.startup+0xf8f): undefined reference to `slagsy_'
linktest.c:(.text.startup+0xf96): undefined reference to `slahilb_'
linktest.c:(.text.startup+0xf9d): undefined reference to `slakf2_'
linktest.c:(.text.startup+0xfa4): undefined reference to `slaran_'
linktest.c:(.text.startup+0xfab): undefined reference to `slarge_'
linktest.c:(.text.startup+0xfb2): undefined reference to `slarnd_'
linktest.c:(.text.startup+0xfb9): undefined reference to `slaror_'
linktest.c:(.text.startup+0xfc0): undefined reference to `slarot_'
linktest.c:(.text.startup+0xfc7): undefined reference to `slatm1_'
linktest.c:(.text.startup+0xfce): undefined reference to `slatm2_'
linktest.c:(.text.startup+0xfd5): undefined reference to `slatm3_'
linktest.c:(.text.startup+0xfdc): undefined reference to `slatm5_'
linktest.c:(.text.startup+0xfe3): undefined reference to `slatm6_'
linktest.c:(.text.startup+0xfea): undefined reference to `slatm7_'
linktest.c:(.text.startup+0xff1): undefined reference to `slatme_'
linktest.c:(.text.startup+0xff8): undefined reference to `slatmr_'
linktest.c:(.text.startup+0xfff): undefined reference to `slatms_'
linktest.c:(.text.startup+0x1006): undefined reference to `slatmt_'
linktest.c:(.text.startup+0x1b82): undefined reference to `dlagge_'
linktest.c:(.text.startup+0x1b89): undefined reference to `dlagsy_'
linktest.c:(.text.startup+0x1b90): undefined reference to `dlahilb_'
linktest.c:(.text.startup+0x1b97): undefined reference to `dlakf2_'
linktest.c:(.text.startup+0x1b9e): undefined reference to `dlaran_'
linktest.c:(.text.startup+0x1ba5): undefined reference to `dlarge_'
linktest.c:(.text.startup+0x1bac): undefined reference to `dlarnd_'
linktest.c:(.text.startup+0x1bb3): undefined reference to `dlaror_'
linktest.c:(.text.startup+0x1bba): undefined reference to `dlarot_'
linktest.c:(.text.startup+0x1bc1): undefined reference to `dlatm1_'
linktest.c:(.text.startup+0x1bc8): undefined reference to `dlatm2_'
linktest.c:(.text.startup+0x1bcf): undefined reference to `dlatm3_'
linktest.c:(.text.startup+0x1bd6): undefined reference to `dlatm5_'
linktest.c:(.text.startup+0x1bdd): undefined reference to `dlatm6_'
linktest.c:(.text.startup+0x1be4): undefined reference to `dlatm7_'
linktest.c:(.text.startup+0x1beb): undefined reference to `dlatme_'
linktest.c:(.text.startup+0x1bf2): undefined reference to `dlatmr_'
linktest.c:(.text.startup+0x1bf9): undefined reference to `dlatms_'
linktest.c:(.text.startup+0x1c00): undefined reference to `dlatmt_'
linktest.c:(.text.startup+0x289b): undefined reference to `clagge_'
linktest.c:(.text.startup+0x28a2): undefined reference to `claghe_'
linktest.c:(.text.startup+0x28a9): undefined reference to `clagsy_'
linktest.c:(.text.startup+0x28b0): undefined reference to `clahilb_'
linktest.c:(.text.startup+0x28b7): undefined reference to `clakf2_'
linktest.c:(.text.startup+0x28be): undefined reference to `clarge_'
linktest.c:(.text.startup+0x28c5): undefined reference to `clarnd_'
linktest.c:(.text.startup+0x28cc): undefined reference to `claror_'
linktest.c:(.text.startup+0x28d3): undefined reference to `clarot_'
linktest.c:(.text.startup+0x28da): undefined reference to `clatm1_'
linktest.c:(.text.startup+0x28e1): undefined reference to `clatm2_'
linktest.c:(.text.startup+0x28e8): undefined reference to `clatm3_'
linktest.c:(.text.startup+0x28ef): undefined reference to `clatm5_'
linktest.c:(.text.startup+0x28f6): undefined reference to `clatm6_'
linktest.c:(.text.startup+0x28fd): undefined reference to `clatme_'
linktest.c:(.text.startup+0x2904): undefined reference to `clatmr_'
linktest.c:(.text.startup+0x290b): undefined reference to `clatms_'
linktest.c:(.text.startup+0x2912): undefined reference to `clatmt_'
linktest.c:(.text.startup+0x33d1): undefined reference to `zlagge_'
linktest.c:(.text.startup+0x33d8): undefined reference to `zlaghe_'
linktest.c:(.text.startup+0x33df): undefined reference to `zlagsy_'
linktest.c:(.text.startup+0x33e6): undefined reference to `zlahilb_'
linktest.c:(.text.startup+0x33ed): undefined reference to `zlakf2_'
linktest.c:(.text.startup+0x33f4): undefined reference to `zlarge_'
linktest.c:(.text.startup+0x33fb): undefined reference to `zlarnd_'
linktest.c:(.text.startup+0x3402): undefined reference to `zlaror_'
linktest.c:(.text.startup+0x3409): undefined reference to `zlarot_'
linktest.c:(.text.startup+0x3410): undefined reference to `zlatm1_'
linktest.c:(.text.startup+0x3417): undefined reference to `zlatm2_'
linktest.c:(.text.startup+0x341e): undefined reference to `zlatm3_'
linktest.c:(.text.startup+0x3425): undefined reference to `zlatm5_'
linktest.c:(.text.startup+0x342c): undefined reference to `zlatm6_'
linktest.c:(.text.startup+0x3433): undefined reference to `zlatme_'
linktest.c:(.text.startup+0x343a): undefined reference to `zlatmr_'
linktest.c:(.text.startup+0x3441): undefined reference to `zlatms_'
linktest.c:(.text.startup+0x3448): undefined reference to `zlatmt_'
collect2: error: ld returned 1 exit status
Makefile:181: recipe for target '../libopenblas-r0.3.12.so' failed
make[2]: *** [../libopenblas-r0.3.12.so] Error 1
make[2]: Leaving directory '/global/D1/homes/james/ex3modules/defq/1.0.0/src/openblas-0.3.12/exports'
Makefile:116: recipe for target 'shared' failed
make[1]: *** [shared] Error 2
make[1]: Leaving directory '/global/D1/homes/james/ex3modules/defq/1.0.0/src/openblas-0.3.12'
makefiles/openblas-0.3.12.mk:53: recipe for target '/global/D1/homes/james/ex3modules/defq/1.0.0/pkgs/openblas-0.3.12/.pkgbuild' failed
make: *** [/global/D1/homes/james/ex3modules/defq/1.0.0/pkgs/openblas-0.3.12/.pkgbuild] Error 2
martin-frbg commented 2 years ago

Hm, that looks a lot like MATGEN (same as original post) so any addition to the LAPACKE Makefile may have been coincidental (or just enough to reduce pressure on the filesystem as a side effect)...

martin-frbg commented 2 years ago

... (or was that just a Freudian slip, and you actually edited TESTING/MATGEN/Makefile but were thinking about LAPACKE when you wrote your comment ? MATGEN/Makefile was what I suggested back then, and "fixing" just that Makefile would have much less impact on build times on unaffected systems)

brada4 commented 2 years ago

Please attach entire log, there should be complaints from make about dates of files being in future. Just grep -i future in that output.

jamtrott commented 2 years ago

Hm, that looks a lot like MATGEN (same as original post) so any addition to the LAPACKE Makefile may have been coincidental (or just enough to reduce pressure on the filesystem as a side effect)...

I see. I did in fact edit lapack-netlib/LAPACKE/src/Makefile. Maybe you are right that it was only a coincidental fix.

I also tried again with your suggestion of adding .NOTPARALLEL: to lapack-netlib/TESTING/MATGEN/Makefile instead, and so far it appears to have done the trick. That is, I have built a couple of times without observing any issues. (I made sure there was a fairly heavy load on the distributed file system during the builds.)

I have attached the standard output and standard error streams from the failed attempt mentioned in my previous comment. The build command is:

$ make FC=gfortran DYNAMIC_ARCH=1 TARGET=HASWELL USE_THREAD=0 USE_LOCKING=1 USE_OPENMP=0 NUM_THREADS=256 NO_AFFINITY=1

There are no messages about future dates or timestamps.

openblas-0.3.12-stdout.txt openblas-0.3.12-stderr.txt

brada4 commented 2 years ago

What is in pkgs/Makefile ? Like patches? Parameter filters? Any env variables set?