chrchang / plink-ng

A comprehensive update to the PLINK association analysis toolset. Beta testing of the first new version (1.90), focused on speed and memory efficiency improvements, is finishing up. Development is now focused on building out support for multiallelic, phased, and dosage data in PLINK 2.0.
https://www.cog-genomics.org/plink/2.0/
414 stars 127 forks source link

Crashes in latest #71

Closed outpaddling closed 6 years ago

outpaddling commented 6 years ago

I see development is moving very fast so if this sort of issue is no surprise right now, just say so. But in case it's helpful to know, I'm seeing illegal instruction traps immediately on startup on FreeBSD 11 and CentOS 7. The crash is happening in different places on the two platforms. My guess would be some sort of prior variable corruption. Commit 3dbb2ba332efe585056ed9b1d3d8c03e870e379e also crashed, but in a different place, which was the same on both platforms. I can retrieve that info as well if it would help.

FreeBSD:

Program terminated with signal SIGILL, Illegal instruction.

0 0x00000000004a9f0a in plink2::InitChrInfo (cip=0x7fffffffa240)

at plink2_common.cc:618

618 uintptr_t alloc_iter = &(cip->chr_mask[BitCtToVecCt(kChrRawEnd) kWordsPerVec]); (gdb) run Starting program: /usr/local/bin/plink2

Program received signal SIGILL, Illegal instruction. 0x00000000004a9f0a in plink2::InitChrInfo (cip=0x7fffffffa1e0) at plink2_common.cc:618 618 uintptr_t alloc_iter = &(cip->chr_mask[BitCtToVecCt(kChrRawEnd) kWordsPerVec]); (gdb) where

0 0x00000000004a9f0a in plink2::InitChrInfo (cip=0x7fffffffa1e0)

at plink2_common.cc:618

1 0x0000000000487b24 in main (argc=1, argv=0x7fffffffe8c0) at plink2.cc:2692

CentOS:

"Auto-loading safe path" section in the GDB manual. E.g., run from the shell: info "(gdb)Auto-loading safe path" [New Thread 0x7ffff4512700 (LWP 22047)] [New Thread 0x7ffff3d11700 (LWP 22048)] [New Thread 0x7ffff1510700 (LWP 22049)]

Program received signal SIGILL, Illegal instruction. plink2::InitGlm (glm_info_ptr=glm_info_ptr@entry=0x7fffffff9e50) at plink2_glm.cc:31 31 glm_info_ptr->max_corr = 0.999; Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7_4.2.x86_64 (gdb) where

0 plink2::InitGlm (glm_info_ptr=glm_info_ptr@entry=0x7fffffff9e50)

at plink2_glm.cc:31

1 0x0000000000403ffb in main (argc=1, argv=0x7fffffffe2e8) at plink2.cc:2682

chrchang commented 6 years ago

Okay, I'll take a look at the functions which execute at or before line 2692 (literally nothing but variable initializations).

chrchang commented 6 years ago

I can make a wild guess as to what might be happening in the FreeBSD crash: the vecaligned_malloc() call on line 615 of plink2_common.cc is not functioning as intended due to some sort of incorrect assumption. What kind of system are you running on? Is there e.g. an equivalent EC2 image I can use to replicate the issue?

The location of the CentOS crash makes less sense to me.

If you recompile plink2 as a pure C program, do the crashes remain?

chrchang commented 6 years ago

After a bit more thought, I do have a guess regarding the CentOS crash: the compiler generated some sort of aligned-move instruction, and max_corr's address was not adequately aligned.

The current codebase has a bunch of "#ifdef arm #error [comment] #endif" blocks marking representative code that I need to revisit to support platforms with stricter alignment requirements than x86. I expect that to take several weeks of dedicated work. But I'm guessing that you're still running on x86 and we just need to figure out what additional things your compilers expect?

outpaddling commented 6 years ago

I found that if I build on FreeBSD without optimizations, it does not crash. That to me further suggests data corruption, as moving variables around can alter the impact of a stray subscript, etc. Doing the same on CentOS doesn't help, though. If you'd like to replicate the FreeBSD build, probably the easiest way would be using my VM image: http://www.peregrine.hpc.uwm.edu/vm.php After importing the OVA, su to root and run the following:

pkg install gcc6 openblas  # Just to be sure you don't build them from source as dependencies
wip-update
cd /usr/ports/wip/plink-ng
make install
plink2

We're not in a hurry on this, I'm actually getting a little ahead for once by working on 2.0-alpha packages. BTW, I'm using clang 4.0.0 on FreeBSD and gcc 5.5 on CentOS. I'll try to look into your other suggestions tomorrow. Thanks...

chrchang commented 6 years ago

Thanks! There's probably something I'm doing that my usual compilers interpret in the way I expect, but is formally undefined behavior that I should purge from my code. Will look at both VMs.

outpaddling commented 6 years ago

On the CentOS side, we use pkgsrc for bleeding-edge open source deployment. Instructions for bootstrapping pkgsrc are here: http://uwm.edu/hpc/software-management/ My CentOS VMs have two pkgsrc trees preinstalled. /usr/pkg is for system tools, /sharedapps/pkg-2018Q1 is for scientific software. To replicate the issue I'm seeing on CentOS 7: As root:

module load /sharedapps/pkg-2018Q1/etc/modulefiles/pkgsrc/2018Q1
cd /sharedapps/pkgsrc-2018Q1/wip
git pull
cd plink2
bmake install
plink2

Thanks again for looking into this.

chrchang commented 6 years ago

Hrm, the FreeBSD build is actually working fine in the VM on my Mac?! Will retry this on an old Windows 8 box.

chrchang commented 6 years ago

My revised guess is that you're running on a machine without AVX2 support, and should be using (and editing) the Makefile under build_dynamic/ . The Makefile in the main source directory is really there for development convenience, and it's currently configured to produce builds that require AVX2 instructions.

outpaddling commented 6 years ago

You are correct. In hindsight, "illegal instruction" should have prompted me to look closer at the compiler flags. I need a loooong vacation... Changing -mavx2 to -mavx was enough for our dev server. Have you actually measured how much difference these specific optimizations make to total run time?

What I could do here is add port/package options so the default build is portable, but when someone builds and installs from source, they could select additional optimization flags.

Thanks!

Suggestion for posterity:

--- Makefile.orig   2018-05-03 22:24:28 UTC
+++ Makefile
@@ -1,3 +1,10 @@
+
+###########################################################################
+# Note: This Makefile is intended for development.  For production
+# builds, use build_dynamic/Makefile or edit out assumptions such as
+# -mavx2.
+###########################################################################
+
 BASEFLAGS=-mavx2 -mbmi -mbmi2 -mlzcnt -DNDEBUG -DZSTD_MULTITHREAD
 # BASEFLAGS=-mavx2 -mbmi -mbmi2 -mlzcnt -DZSTD_MULTITHREAD
 # BASEFLAGS=-msse4.2 -DNDEBUG -DZSTD_MULTITHREAD
chrchang commented 6 years ago

Yes, the improvement is obviously workload-dependent but there's one significant step up at SSE4.2 and another at AVX2 (this is why the build_dynamic Makefile starts with NO_AVX2 and NO_SSE42 configuration variables).

outpaddling commented 6 years ago

Have you tried using -march=native to let the compiler detect the underlying CPU and choose optimizations for it? http://gcc.gnu.org/onlinedocs/gcc/x86-Options.html I would think if a compiler has optimizations for a feature like AVX2, it would also have the ability to detect it in the CPU.

chrchang commented 6 years ago

Yes, that may make sense in the build_dynamic Makefile. I don't normally use it since I need to produce prebuilt binaries that are reliable across a wide variety of machines, so I want to test the portable AVX2 and basic-64-bit configurations as heavily as possible.

outpaddling commented 6 years ago

OK, cool. I modified the FreeBSD port so it builds portable binaries by default and has a single configure option that adds -march=native if selected. I'll do the same with the pkgsrc package.

Run "wip-update" in your FreeBSD VM again to see for yourself. I changed the name of the port from plink-ng to plink2, BTW.

So the FreeBSD binary package that will be available once this is committed will be portable, but not optimal.

Assuming the compiler correctly detects the CPU, someone can easily get an build optimized for their CPU following the example below. ( The wip category will eventually be replaced with biology )

<<<ROOT@imacbsd.acadix>>> /home/bacon 1002 # cd /usr/ports/wip/plink2/
<<<ROOT@imacbsd.acadix>>> /usr/ports/wip/plink2 1003 # make config

  ┌───────────────────────── plink-ng-g20180503 ───────────────────────────┐
  │ ┌────────────────────────────────────────────────────────────────────┐ │  
  │ │+[X] NATIVE_CPU  Optimize for this CPU (build non-portable binaries)│ │  
  │ └────────────────────────────────────────────────────────────────────┘ │  
  ├────────────────────────────────────────────────────────────────────────┤  
  │                     <  OK  >           <Cancel>                        │  
  └────────────────────────────────────────────────────────────────────────┘ 
<<<ROOT@imacbsd.acadix>>> /usr/ports/wip/plink2 1007 # make install
[snip]
cc -c -O2 -pipe  -DDYNAMIC_ZLIB -I/usr/local/include -march=native -fstack-protector -fno-strict-aliasing -std=gnu99 -DNDEBUG -DZSTD_MULTITHREAD -Wall -Wextra -Wshadow -Wformat-security -Ihtslib -Ilibdeflate -Ilibdeflate/common -Izstd/lib -Izstd/lib/common -Izstd/zlibWrapper  -o zstd/lib/decompress/huf_decompress.o zstd/lib/decompress/huf_decompress.c
[snip]
c++ plink2_base.o pgenlib_internal.o pgen_compress.o \
    -o bin/pgen_compress
mkdir -p /usr/ports/wip/plink2/work/stage/usr/local/bin
install -c bin/* /usr/ports/wip/plink2/work/stage/usr/local/bin
strip /usr/ports/wip/plink2/work/stage/usr/local/bin/*
gmake[2]: Leaving directory '/usr/ports/wip/plink2/work/plink-ng-e81056a32d67cd73a69579e2dbb374d16196f305/2.0'
====> Compressing man pages (compress-man)
====> Running Q/A tests (stage-qa)
===>  Installing for plink-ng-g20180503
===>  Checking if plink-ng already installed
===>   Registering installation for plink-ng-g20180503
Installing plink-ng-g20180503...
outpaddling commented 6 years ago

Do you have a test case that would be good for comparative benchmarking? I.e. something that will take around 10 minutes to run and show significant different speedup from SSE and AVX? Thanks.

chrchang commented 6 years ago

Setup: plink2 --dummy 9999 999999 --out testdata

Benchmark #1: --keep cat testdata.psam | cut -f 1 | tail -n +2 | sort -R | head -n 6666 > id_subset.txt plink2 --pfile testdata --keep id_subset.txt --make-pgen SSE2 (times on 2017 Macbook Pro): real 0m6.892s user 0m14.533s sys 0m2.049s AVX2: real 0m2.756s user 0m1.948s sys 0m1.845s

Benchmark #2: KING-robust plink2 --pfile testdata --make-king SSE2: real 9m57.518s user 66m38.673s sys 0m5.006s AVX2: real 6m47.591s user 44m10.753s sys 0m5.599s

chrchang commented 6 years ago

fyi, I've updated the SSE4.2 and AVX2 builds to print a clear error message when the necessary processor support is unavailable (CPU_CHECK in the build_dynamic Makefile). It'll take some work to get this to play well with -march=native, though.

outpaddling commented 6 years ago

Not a bad idea. I have some other suggestions for the build system, but will come back to that later.

Thanks for the benchmark instructions, just what I needed. Below are some data from our new development server, a PowerEdge R440 with 24 cores (48 hyperthreads), 128GiB RAM, and the script used to generate them.

Interesting highlights:

Generic build with -O2 performed the same as the binary from your website (as expected) -march=native performed the same as -mavx2 ..., about 30% faster than the generic build -O3 build was significantly SLOWER than -O2 (not that surprising, I've seen unstable results from -O3 before) 24 threads ran slightly faster than 48 (hyperthreading is a hair beyond useless here) plink2 scales well from 12 to 24 threads BLAS implementation appears to be irrelevant (apparently not a bottleneck for plink2)

------------------------------------------------------------------------------

Linux unixdev2.hpc  bacon ~/Plink2-bench 1016: uname -a
Linux unixdev2.hpc.uwm.edu 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Linux unixdev2.hpc  bacon ~/Plink2-bench 1017: /sharedapps/pkg-2018Q1/gcc5/bin/gcc --version
gcc (GCC) 5.5.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Linux unixdev2.hpc  bacon ~/Plink2-bench 1037: more /proc/cpuinfo 
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 85
model name  : Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz
stepping    : 4
microcode   : 0x200002c
cpu MHz     : 2300.000
cache size  : 16896 KB
physical id : 0
siblings    : 24
core id     : 0
cpu cores   : 12
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdt
scp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_ts
c aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 
fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer
 aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_s
ingle intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hl
e avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap
 clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_
occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
bogomips    : 4600.00
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

------------------------------------------------------------------------------

# Binary downloaded from https://www.cog-genomics.org/plink/2.0/

plink2-bench-binary-atlas:real  0m6.128s
plink2-bench-binary-atlas-user  0m13.684s
--
plink2-bench-binary-atlas:real  2m6.290s
plink2-bench-binary-atlas-user  88m19.458s
--
plink2-bench-binary-atlas:real  0m6.691s
plink2-bench-binary-atlas-user  0m14.029s
--
plink2-bench-binary-atlas:real  2m5.016s
plink2-bench-binary-atlas-user  88m8.531s
--
plink2-bench-binary-atlas:real  0m6.043s
plink2-bench-binary-atlas-user  0m13.620s
--
plink2-bench-binary-atlas:real  2m5.998s
plink2-bench-binary-atlas-user  87m56.492s

------------------------------------------------------------------------------

# Built from source with -O2, OpenBLAS

plink2-bench-O2-openblas:real   0m5.874s
plink2-bench-O2-openblas-user   0m13.285s
--
plink2-bench-O2-openblas:real   2m4.042s
plink2-bench-O2-openblas-user   87m7.210s
--
plink2-bench-O2-openblas:real   0m8.107s
plink2-bench-O2-openblas-user   0m13.931s
--
plink2-bench-O2-openblas:real   2m4.240s
plink2-bench-O2-openblas-user   87m4.180s
--
plink2-bench-O2-openblas:real   0m7.976s
plink2-bench-O2-openblas-user   0m13.198s
--
plink2-bench-O2-openblas:real   2m7.515s
plink2-bench-O2-openblas-user   86m56.461s

------------------------------------------------------------------------------

# Built from source with -O2 -march=native, OpenBLAS

plink2-bench-O2-native-openblas:real    0m2.509s
plink2-bench-O2-native-openblas-user    0m2.470s
--
plink2-bench-O2-native-openblas:real    1m29.620s
plink2-bench-O2-native-openblas-user    60m54.906s
--
plink2-bench-O2-native-openblas:real    0m2.719s
plink2-bench-O2-native-openblas-user    0m2.453s
--
plink2-bench-O2-native-openblas:real    1m30.367s
plink2-bench-O2-native-openblas-user    60m51.870s
--
plink2-bench-O2-native-openblas:real    0m4.336s
plink2-bench-O2-native-openblas-user    0m2.661s
--
plink2-bench-O2-native-openblas:real    1m29.677s
plink2-bench-O2-native-openblas-user    60m55.924s

------------------------------------------------------------------------------

# Built from source with -O3 -march=native, OpenBLAS

plink2-bench-O3-native-openblas:real    0m2.488s
plink2-bench-O3-native-openblas-user    0m2.438s
--
plink2-bench-O3-native-openblas:real    1m43.165s
plink2-bench-O3-native-openblas-user    71m56.944s
--
plink2-bench-O3-native-openblas:real    0m2.982s
plink2-bench-O3-native-openblas-user    0m2.448s
--
plink2-bench-O3-native-openblas:real    1m45.404s
plink2-bench-O3-native-openblas-user    71m38.413s
--
plink2-bench-O3-native-openblas:real    0m3.292s
plink2-bench-O3-native-openblas-user    0m2.527s
--
plink2-bench-O3-native-openblas:real    1m44.893s
plink2-bench-O3-native-openblas-user    71m49.356s

------------------------------------------------------------------------------

# Built from source with -O2 -mavx2 -mbmi -mbmi2 -mlzcnt

plink2-bench-O2-avx2-openblas:real  0m2.468s
plink2-bench-O2-avx2-openblas-user  0m2.341s
--
plink2-bench-O2-avx2-openblas:real  1m31.216s
plink2-bench-O2-avx2-openblas-user  60m28.717s
--
plink2-bench-O2-avx2-openblas:real  0m3.268s
plink2-bench-O2-avx2-openblas-user  0m2.416s
--
plink2-bench-O2-avx2-openblas:real  1m29.781s
plink2-bench-O2-avx2-openblas-user  60m35.905s
--
plink2-bench-O2-avx2-openblas:real  0m4.161s
plink2-bench-O2-avx2-openblas-user  0m2.325s
--
plink2-bench-O2-avx2-openblas:real  1m29.367s
plink2-bench-O2-avx2-openblas-user  60m41.462s

------------------------------------------------------------------------------

# Built from source with -O2 -march=native, netlib BLAS

plink2-bench-O2-native-netlib:real  0m2.620s
plink2-bench-O2-native-netlib-user  0m2.458s
--
plink2-bench-O2-native-netlib:real  1m29.347s
plink2-bench-O2-native-netlib-user  61m1.021s
--
plink2-bench-O2-native-netlib:real  0m2.804s
plink2-bench-O2-native-netlib-user  0m2.480s
--
plink2-bench-O2-native-netlib:real  1m30.026s
plink2-bench-O2-native-netlib-user  61m3.396s
--
plink2-bench-O2-native-netlib:real  0m3.813s
plink2-bench-O2-native-netlib-user  0m2.478s
--
plink2-bench-O2-native-netlib:real  1m30.649s
plink2-bench-O2-native-netlib-user  60m46.804s

------------------------------------------------------------------------------

# Built from source with -O2 -march=native, OpenBLAS, limit to 24 threads

plink2-bench-O2-native-openblas-24t:real    0m2.657s
plink2-bench-O2-native-openblas-24t-user    0m2.446s
--
plink2-bench-O2-native-openblas-24t:real    1m28.316s
plink2-bench-O2-native-openblas-24t-user    31m22.039s
--
plink2-bench-O2-native-openblas-24t:real    0m3.151s
plink2-bench-O2-native-openblas-24t-user    0m2.455s
--
plink2-bench-O2-native-openblas-24t:real    1m25.833s
plink2-bench-O2-native-openblas-24t-user    31m20.287s
--
plink2-bench-O2-native-openblas-24t:real    0m3.832s
plink2-bench-O2-native-openblas-24t-user    0m2.479s
--
plink2-bench-O2-native-openblas-24t:real    1m26.969s
plink2-bench-O2-native-openblas-24t-user    31m22.864s

------------------------------------------------------------------------------

# Built from source with -O2 -march=native, OpenBLAS, limit to 12 threads

plink2-bench-O2-native-openblas-12t:real    0m3.015s
plink2-bench-O2-native-openblas-12t-user    0m2.510s
--
plink2-bench-O2-native-openblas-12t:real    2m42.700s
plink2-bench-O2-native-openblas-12t-user    29m8.215s
--
plink2-bench-O2-native-openblas-12t:real    0m3.087s
plink2-bench-O2-native-openblas-12t-user    0m2.478s
--
plink2-bench-O2-native-openblas-12t:real    2m42.694s
plink2-bench-O2-native-openblas-12t-user    29m7.588s
--
plink2-bench-O2-native-openblas-12t:real    0m3.102s
plink2-bench-O2-native-openblas-12t-user    0m2.404s
--
plink2-bench-O2-native-openblas-12t:real    2m41.911s
plink2-bench-O2-native-openblas-12t-user    29m5.689s

------------------------------------------------------------------------------
#!/bin/sh -e

usage()
{
    printf "Usage: $0 threads\n"
    exit 1
}

##########################################################################
#   Main
##########################################################################

if [ $# != 1 ]; then
    usage
fi

threads=$1

rm -f plink2.* testdata*

printf "Generating test data...\n"
plink2 --dummy 9999 999999 --out testdata
cat testdata.psam | cut -f 1 | tail -n +2 | sort -R | head -n 6666 > id_subset.txt
which plink2

for trial in 1 2 3; do
    printf "\n===========================================================\n"
    printf "Running --keep benchmark..."
    printf "\n===========================================================\n"
    time plink2 --pfile testdata --keep id_subset.txt --make-pgen --threads $threads

    printf "\n===========================================================\n"
    printf "Running KING-robust benchmark...\n"
    printf "\n===========================================================\n"
    time plink2 --pfile testdata --make-king --threads $threads
done
chrchang commented 6 years ago

BLAS is nearly irrelevant ~90% of the time, but it's really, really important the other ~10%. One critical use case is --pca (it's necessary to use realistic data from e.g. 1000 Genomes here).

outpaddling commented 6 years ago

Thanks for the info. My FreeBSD port and pkgsrc package are working well now. Looking forward to the 2.0 release...