Closed outpaddling closed 6 years ago
Okay, I'll take a look at the functions which execute at or before line 2692 (literally nothing but variable initializations).
I can make a wild guess as to what might be happening in the FreeBSD crash: the vecaligned_malloc() call on line 615 of plink2_common.cc is not functioning as intended due to some sort of incorrect assumption. What kind of system are you running on? Is there e.g. an equivalent EC2 image I can use to replicate the issue?
The location of the CentOS crash makes less sense to me.
If you recompile plink2 as a pure C program, do the crashes remain?
After a bit more thought, I do have a guess regarding the CentOS crash: the compiler generated some sort of aligned-move instruction, and max_corr's address was not adequately aligned.
The current codebase has a bunch of "#ifdef arm #error [comment] #endif" blocks marking representative code that I need to revisit to support platforms with stricter alignment requirements than x86. I expect that to take several weeks of dedicated work. But I'm guessing that you're still running on x86 and we just need to figure out what additional things your compilers expect?
I found that if I build on FreeBSD without optimizations, it does not crash. That to me further suggests data corruption, as moving variables around can alter the impact of a stray subscript, etc. Doing the same on CentOS doesn't help, though. If you'd like to replicate the FreeBSD build, probably the easiest way would be using my VM image: http://www.peregrine.hpc.uwm.edu/vm.php After importing the OVA, su to root and run the following:
pkg install gcc6 openblas # Just to be sure you don't build them from source as dependencies
wip-update
cd /usr/ports/wip/plink-ng
make install
plink2
We're not in a hurry on this, I'm actually getting a little ahead for once by working on 2.0-alpha packages. BTW, I'm using clang 4.0.0 on FreeBSD and gcc 5.5 on CentOS. I'll try to look into your other suggestions tomorrow. Thanks...
Thanks! There's probably something I'm doing that my usual compilers interpret in the way I expect, but is formally undefined behavior that I should purge from my code. Will look at both VMs.
On the CentOS side, we use pkgsrc for bleeding-edge open source deployment. Instructions for bootstrapping pkgsrc are here: http://uwm.edu/hpc/software-management/ My CentOS VMs have two pkgsrc trees preinstalled. /usr/pkg is for system tools, /sharedapps/pkg-2018Q1 is for scientific software. To replicate the issue I'm seeing on CentOS 7: As root:
module load /sharedapps/pkg-2018Q1/etc/modulefiles/pkgsrc/2018Q1
cd /sharedapps/pkgsrc-2018Q1/wip
git pull
cd plink2
bmake install
plink2
Thanks again for looking into this.
Hrm, the FreeBSD build is actually working fine in the VM on my Mac?! Will retry this on an old Windows 8 box.
My revised guess is that you're running on a machine without AVX2 support, and should be using (and editing) the Makefile under build_dynamic/ . The Makefile in the main source directory is really there for development convenience, and it's currently configured to produce builds that require AVX2 instructions.
You are correct. In hindsight, "illegal instruction" should have prompted me to look closer at the compiler flags. I need a loooong vacation... Changing -mavx2 to -mavx was enough for our dev server. Have you actually measured how much difference these specific optimizations make to total run time?
What I could do here is add port/package options so the default build is portable, but when someone builds and installs from source, they could select additional optimization flags.
Thanks!
Suggestion for posterity:
--- Makefile.orig 2018-05-03 22:24:28 UTC
+++ Makefile
@@ -1,3 +1,10 @@
+
+###########################################################################
+# Note: This Makefile is intended for development. For production
+# builds, use build_dynamic/Makefile or edit out assumptions such as
+# -mavx2.
+###########################################################################
+
BASEFLAGS=-mavx2 -mbmi -mbmi2 -mlzcnt -DNDEBUG -DZSTD_MULTITHREAD
# BASEFLAGS=-mavx2 -mbmi -mbmi2 -mlzcnt -DZSTD_MULTITHREAD
# BASEFLAGS=-msse4.2 -DNDEBUG -DZSTD_MULTITHREAD
Yes, the improvement is obviously workload-dependent but there's one significant step up at SSE4.2 and another at AVX2 (this is why the build_dynamic Makefile starts with NO_AVX2 and NO_SSE42 configuration variables).
Have you tried using -march=native to let the compiler detect the underlying CPU and choose optimizations for it? http://gcc.gnu.org/onlinedocs/gcc/x86-Options.html I would think if a compiler has optimizations for a feature like AVX2, it would also have the ability to detect it in the CPU.
Yes, that may make sense in the build_dynamic Makefile. I don't normally use it since I need to produce prebuilt binaries that are reliable across a wide variety of machines, so I want to test the portable AVX2 and basic-64-bit configurations as heavily as possible.
OK, cool. I modified the FreeBSD port so it builds portable binaries by default and has a single configure option that adds -march=native if selected. I'll do the same with the pkgsrc package.
Run "wip-update" in your FreeBSD VM again to see for yourself. I changed the name of the port from plink-ng to plink2, BTW.
So the FreeBSD binary package that will be available once this is committed will be portable, but not optimal.
Assuming the compiler correctly detects the CPU, someone can easily get an build optimized for their CPU following the example below. ( The wip category will eventually be replaced with biology )
<<<ROOT@imacbsd.acadix>>> /home/bacon 1002 # cd /usr/ports/wip/plink2/
<<<ROOT@imacbsd.acadix>>> /usr/ports/wip/plink2 1003 # make config
┌───────────────────────── plink-ng-g20180503 ───────────────────────────┐
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │+[X] NATIVE_CPU Optimize for this CPU (build non-portable binaries)│ │
│ └────────────────────────────────────────────────────────────────────┘ │
├────────────────────────────────────────────────────────────────────────┤
│ < OK > <Cancel> │
└────────────────────────────────────────────────────────────────────────┘
<<<ROOT@imacbsd.acadix>>> /usr/ports/wip/plink2 1007 # make install
[snip]
cc -c -O2 -pipe -DDYNAMIC_ZLIB -I/usr/local/include -march=native -fstack-protector -fno-strict-aliasing -std=gnu99 -DNDEBUG -DZSTD_MULTITHREAD -Wall -Wextra -Wshadow -Wformat-security -Ihtslib -Ilibdeflate -Ilibdeflate/common -Izstd/lib -Izstd/lib/common -Izstd/zlibWrapper -o zstd/lib/decompress/huf_decompress.o zstd/lib/decompress/huf_decompress.c
[snip]
c++ plink2_base.o pgenlib_internal.o pgen_compress.o \
-o bin/pgen_compress
mkdir -p /usr/ports/wip/plink2/work/stage/usr/local/bin
install -c bin/* /usr/ports/wip/plink2/work/stage/usr/local/bin
strip /usr/ports/wip/plink2/work/stage/usr/local/bin/*
gmake[2]: Leaving directory '/usr/ports/wip/plink2/work/plink-ng-e81056a32d67cd73a69579e2dbb374d16196f305/2.0'
====> Compressing man pages (compress-man)
====> Running Q/A tests (stage-qa)
===> Installing for plink-ng-g20180503
===> Checking if plink-ng already installed
===> Registering installation for plink-ng-g20180503
Installing plink-ng-g20180503...
Do you have a test case that would be good for comparative benchmarking? I.e. something that will take around 10 minutes to run and show significant different speedup from SSE and AVX? Thanks.
Setup: plink2 --dummy 9999 999999 --out testdata
Benchmark #1: --keep cat testdata.psam | cut -f 1 | tail -n +2 | sort -R | head -n 6666 > id_subset.txt plink2 --pfile testdata --keep id_subset.txt --make-pgen SSE2 (times on 2017 Macbook Pro): real 0m6.892s user 0m14.533s sys 0m2.049s AVX2: real 0m2.756s user 0m1.948s sys 0m1.845s
Benchmark #2: KING-robust plink2 --pfile testdata --make-king SSE2: real 9m57.518s user 66m38.673s sys 0m5.006s AVX2: real 6m47.591s user 44m10.753s sys 0m5.599s
fyi, I've updated the SSE4.2 and AVX2 builds to print a clear error message when the necessary processor support is unavailable (CPU_CHECK in the build_dynamic Makefile). It'll take some work to get this to play well with -march=native, though.
Not a bad idea. I have some other suggestions for the build system, but will come back to that later.
Thanks for the benchmark instructions, just what I needed. Below are some data from our new development server, a PowerEdge R440 with 24 cores (48 hyperthreads), 128GiB RAM, and the script used to generate them.
Interesting highlights:
Generic build with -O2 performed the same as the binary from your website (as expected) -march=native performed the same as -mavx2 ..., about 30% faster than the generic build -O3 build was significantly SLOWER than -O2 (not that surprising, I've seen unstable results from -O3 before) 24 threads ran slightly faster than 48 (hyperthreading is a hair beyond useless here) plink2 scales well from 12 to 24 threads BLAS implementation appears to be irrelevant (apparently not a bottleneck for plink2)
------------------------------------------------------------------------------
Linux unixdev2.hpc bacon ~/Plink2-bench 1016: uname -a
Linux unixdev2.hpc.uwm.edu 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Linux unixdev2.hpc bacon ~/Plink2-bench 1017: /sharedapps/pkg-2018Q1/gcc5/bin/gcc --version
gcc (GCC) 5.5.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Linux unixdev2.hpc bacon ~/Plink2-bench 1037: more /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
stepping : 4
microcode : 0x200002c
cpu MHz : 2300.000
cache size : 16896 KB
physical id : 0
siblings : 24
core id : 0
cpu cores : 12
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt
scp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_ts
c aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3
fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer
aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_s
ingle intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hl
e avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap
clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_
occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
bogomips : 4600.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
------------------------------------------------------------------------------
# Binary downloaded from https://www.cog-genomics.org/plink/2.0/
plink2-bench-binary-atlas:real 0m6.128s
plink2-bench-binary-atlas-user 0m13.684s
--
plink2-bench-binary-atlas:real 2m6.290s
plink2-bench-binary-atlas-user 88m19.458s
--
plink2-bench-binary-atlas:real 0m6.691s
plink2-bench-binary-atlas-user 0m14.029s
--
plink2-bench-binary-atlas:real 2m5.016s
plink2-bench-binary-atlas-user 88m8.531s
--
plink2-bench-binary-atlas:real 0m6.043s
plink2-bench-binary-atlas-user 0m13.620s
--
plink2-bench-binary-atlas:real 2m5.998s
plink2-bench-binary-atlas-user 87m56.492s
------------------------------------------------------------------------------
# Built from source with -O2, OpenBLAS
plink2-bench-O2-openblas:real 0m5.874s
plink2-bench-O2-openblas-user 0m13.285s
--
plink2-bench-O2-openblas:real 2m4.042s
plink2-bench-O2-openblas-user 87m7.210s
--
plink2-bench-O2-openblas:real 0m8.107s
plink2-bench-O2-openblas-user 0m13.931s
--
plink2-bench-O2-openblas:real 2m4.240s
plink2-bench-O2-openblas-user 87m4.180s
--
plink2-bench-O2-openblas:real 0m7.976s
plink2-bench-O2-openblas-user 0m13.198s
--
plink2-bench-O2-openblas:real 2m7.515s
plink2-bench-O2-openblas-user 86m56.461s
------------------------------------------------------------------------------
# Built from source with -O2 -march=native, OpenBLAS
plink2-bench-O2-native-openblas:real 0m2.509s
plink2-bench-O2-native-openblas-user 0m2.470s
--
plink2-bench-O2-native-openblas:real 1m29.620s
plink2-bench-O2-native-openblas-user 60m54.906s
--
plink2-bench-O2-native-openblas:real 0m2.719s
plink2-bench-O2-native-openblas-user 0m2.453s
--
plink2-bench-O2-native-openblas:real 1m30.367s
plink2-bench-O2-native-openblas-user 60m51.870s
--
plink2-bench-O2-native-openblas:real 0m4.336s
plink2-bench-O2-native-openblas-user 0m2.661s
--
plink2-bench-O2-native-openblas:real 1m29.677s
plink2-bench-O2-native-openblas-user 60m55.924s
------------------------------------------------------------------------------
# Built from source with -O3 -march=native, OpenBLAS
plink2-bench-O3-native-openblas:real 0m2.488s
plink2-bench-O3-native-openblas-user 0m2.438s
--
plink2-bench-O3-native-openblas:real 1m43.165s
plink2-bench-O3-native-openblas-user 71m56.944s
--
plink2-bench-O3-native-openblas:real 0m2.982s
plink2-bench-O3-native-openblas-user 0m2.448s
--
plink2-bench-O3-native-openblas:real 1m45.404s
plink2-bench-O3-native-openblas-user 71m38.413s
--
plink2-bench-O3-native-openblas:real 0m3.292s
plink2-bench-O3-native-openblas-user 0m2.527s
--
plink2-bench-O3-native-openblas:real 1m44.893s
plink2-bench-O3-native-openblas-user 71m49.356s
------------------------------------------------------------------------------
# Built from source with -O2 -mavx2 -mbmi -mbmi2 -mlzcnt
plink2-bench-O2-avx2-openblas:real 0m2.468s
plink2-bench-O2-avx2-openblas-user 0m2.341s
--
plink2-bench-O2-avx2-openblas:real 1m31.216s
plink2-bench-O2-avx2-openblas-user 60m28.717s
--
plink2-bench-O2-avx2-openblas:real 0m3.268s
plink2-bench-O2-avx2-openblas-user 0m2.416s
--
plink2-bench-O2-avx2-openblas:real 1m29.781s
plink2-bench-O2-avx2-openblas-user 60m35.905s
--
plink2-bench-O2-avx2-openblas:real 0m4.161s
plink2-bench-O2-avx2-openblas-user 0m2.325s
--
plink2-bench-O2-avx2-openblas:real 1m29.367s
plink2-bench-O2-avx2-openblas-user 60m41.462s
------------------------------------------------------------------------------
# Built from source with -O2 -march=native, netlib BLAS
plink2-bench-O2-native-netlib:real 0m2.620s
plink2-bench-O2-native-netlib-user 0m2.458s
--
plink2-bench-O2-native-netlib:real 1m29.347s
plink2-bench-O2-native-netlib-user 61m1.021s
--
plink2-bench-O2-native-netlib:real 0m2.804s
plink2-bench-O2-native-netlib-user 0m2.480s
--
plink2-bench-O2-native-netlib:real 1m30.026s
plink2-bench-O2-native-netlib-user 61m3.396s
--
plink2-bench-O2-native-netlib:real 0m3.813s
plink2-bench-O2-native-netlib-user 0m2.478s
--
plink2-bench-O2-native-netlib:real 1m30.649s
plink2-bench-O2-native-netlib-user 60m46.804s
------------------------------------------------------------------------------
# Built from source with -O2 -march=native, OpenBLAS, limit to 24 threads
plink2-bench-O2-native-openblas-24t:real 0m2.657s
plink2-bench-O2-native-openblas-24t-user 0m2.446s
--
plink2-bench-O2-native-openblas-24t:real 1m28.316s
plink2-bench-O2-native-openblas-24t-user 31m22.039s
--
plink2-bench-O2-native-openblas-24t:real 0m3.151s
plink2-bench-O2-native-openblas-24t-user 0m2.455s
--
plink2-bench-O2-native-openblas-24t:real 1m25.833s
plink2-bench-O2-native-openblas-24t-user 31m20.287s
--
plink2-bench-O2-native-openblas-24t:real 0m3.832s
plink2-bench-O2-native-openblas-24t-user 0m2.479s
--
plink2-bench-O2-native-openblas-24t:real 1m26.969s
plink2-bench-O2-native-openblas-24t-user 31m22.864s
------------------------------------------------------------------------------
# Built from source with -O2 -march=native, OpenBLAS, limit to 12 threads
plink2-bench-O2-native-openblas-12t:real 0m3.015s
plink2-bench-O2-native-openblas-12t-user 0m2.510s
--
plink2-bench-O2-native-openblas-12t:real 2m42.700s
plink2-bench-O2-native-openblas-12t-user 29m8.215s
--
plink2-bench-O2-native-openblas-12t:real 0m3.087s
plink2-bench-O2-native-openblas-12t-user 0m2.478s
--
plink2-bench-O2-native-openblas-12t:real 2m42.694s
plink2-bench-O2-native-openblas-12t-user 29m7.588s
--
plink2-bench-O2-native-openblas-12t:real 0m3.102s
plink2-bench-O2-native-openblas-12t-user 0m2.404s
--
plink2-bench-O2-native-openblas-12t:real 2m41.911s
plink2-bench-O2-native-openblas-12t-user 29m5.689s
------------------------------------------------------------------------------
#!/bin/sh -e
usage()
{
printf "Usage: $0 threads\n"
exit 1
}
##########################################################################
# Main
##########################################################################
if [ $# != 1 ]; then
usage
fi
threads=$1
rm -f plink2.* testdata*
printf "Generating test data...\n"
plink2 --dummy 9999 999999 --out testdata
cat testdata.psam | cut -f 1 | tail -n +2 | sort -R | head -n 6666 > id_subset.txt
which plink2
for trial in 1 2 3; do
printf "\n===========================================================\n"
printf "Running --keep benchmark..."
printf "\n===========================================================\n"
time plink2 --pfile testdata --keep id_subset.txt --make-pgen --threads $threads
printf "\n===========================================================\n"
printf "Running KING-robust benchmark...\n"
printf "\n===========================================================\n"
time plink2 --pfile testdata --make-king --threads $threads
done
BLAS is nearly irrelevant ~90% of the time, but it's really, really important the other ~10%. One critical use case is --pca (it's necessary to use realistic data from e.g. 1000 Genomes here).
Thanks for the info. My FreeBSD port and pkgsrc package are working well now. Looking forward to the 2.0 release...
I see development is moving very fast so if this sort of issue is no surprise right now, just say so. But in case it's helpful to know, I'm seeing illegal instruction traps immediately on startup on FreeBSD 11 and CentOS 7. The crash is happening in different places on the two platforms. My guess would be some sort of prior variable corruption. Commit 3dbb2ba332efe585056ed9b1d3d8c03e870e379e also crashed, but in a different place, which was the same on both platforms. I can retrieve that info as well if it would help.
FreeBSD:
Program terminated with signal SIGILL, Illegal instruction.
0 0x00000000004a9f0a in plink2::InitChrInfo (cip=0x7fffffffa240)
618 uintptr_t alloc_iter = &(cip->chr_mask[BitCtToVecCt(kChrRawEnd) kWordsPerVec]); (gdb) run Starting program: /usr/local/bin/plink2
Program received signal SIGILL, Illegal instruction. 0x00000000004a9f0a in plink2::InitChrInfo (cip=0x7fffffffa1e0) at plink2_common.cc:618 618 uintptr_t alloc_iter = &(cip->chr_mask[BitCtToVecCt(kChrRawEnd) kWordsPerVec]); (gdb) where
0 0x00000000004a9f0a in plink2::InitChrInfo (cip=0x7fffffffa1e0)
1 0x0000000000487b24 in main (argc=1, argv=0x7fffffffe8c0) at plink2.cc:2692
CentOS:
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell: info "(gdb)Auto-loading safe path" [New Thread 0x7ffff4512700 (LWP 22047)] [New Thread 0x7ffff3d11700 (LWP 22048)] [New Thread 0x7ffff1510700 (LWP 22049)]
Program received signal SIGILL, Illegal instruction. plink2::InitGlm (glm_info_ptr=glm_info_ptr@entry=0x7fffffff9e50) at plink2_glm.cc:31 31 glm_info_ptr->max_corr = 0.999; Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7_4.2.x86_64 (gdb) where
0 plink2::InitGlm (glm_info_ptr=glm_info_ptr@entry=0x7fffffff9e50)
1 0x0000000000403ffb in main (argc=1, argv=0x7fffffffe2e8) at plink2.cc:2682