chrchang / plink-ng

A comprehensive update to the PLINK association analysis toolset. Beta testing of the first new version (1.90), focused on speed and memory efficiency improvements, is finishing up. Development is now focused on building out support for multiallelic, phased, and dosage data in PLINK 2.0.
https://www.cog-genomics.org/plink/2.0/
414 stars 127 forks source link

Seg fault on --make-bed #14

Closed ryanlayer closed 9 years ago

ryanlayer commented 9 years ago

I am trying to make a bed file from a 1000 genomes phase 3 bam file, and it seg faults after about 42 minutes.

The version is:

$ plink
PLINK v1.90b2c 64-bit (29 Jul 2014)

The command was:

$ plink \
--make-bed \
--bcf ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf \
--out ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink \
--allow-extra-chr

The output was:

PLINK v1.90b2c 64-bit (29 Jul 2014)         https://www.cog-genomics.org/plink2
(C) 2005-2014 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.log.
32054 MB RAM detected; reserving 16027 MB for main workspace.
--bcf: 84739k variants complete.
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink-temporary.bed
+
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink-temporary.bim
+
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink-temporary.fam
written.
84739846 variants loaded from .bim file.
2504 people (0 males, 0 females, 2504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.nosex
.
Using 1 thread (no multithreaded calculations invoked).
Calculating allele frequencies... done.
Total genotyping rate is 0.980324.
84739846 variants and 2504 people pass filters and QC.
Note: No phenotypes present.
--make-bed to
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.bed
+
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.bim
+
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.fam
Segmentation fault (core dumped)

There were a number of temp files created, here are their sizes and the source bcf size:

 129G ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf
  62K ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink-temporary.fam
    0 ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.log
 2.0G ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink-temporary.bim
  50G ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink-temporary.bed
  40K ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.nosex
    0 ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.bed
 2.0G ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.bim

I am happy to provide any other debug info, just let me know what you need.

Thanks, Ryan

chrchang commented 9 years ago
  1. Check whether the latest stable build also segfaults here.
  2. If it does, check whether the --make-bed component fails in isolation, with --bfile [plink-temporary file] --allow-extra-chr --make-bed after the first crash; this should indicate whether the bug is in the --bcf or the --make-bed code.
ryanlayer commented 9 years ago

Latest version has the same result:

$ time ~/src/plink1.90b2t/plink     --make-bed     --bcf ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf     --out ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink     --allow-extra-chr
PLINK v1.90b2t 64-bit (20 Dec 2014)        https://www.cog-genomics.org/plink2
(C) 2005-2014 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.log.
32054 MB RAM detected; reserving 16027 MB for main workspace.
--bcf: 84739k variants complete.
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink-temporary.bed
+
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink-temporary.bim
+
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink-temporary.fam
written.
84739846 variants loaded from .bim file.
2504 people (0 males, 0 females, 2504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.nosex
.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2504 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.980324.
84739846 variants and 2504 people pass filters and QC.
Note: No phenotypes present.
--make-bed to
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.bed
+
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.bim
+
ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink.fam
Segmentation fault (core dumped)

real    43m27.661s
user    37m54.756s
sys     1m6.058s

When I try your suggestion to run --make-bed using --bfile on the tmp file I get another seg fault:

$ time ~/src/plink1.90b2t/plink     --make-bed     --bfile ALL.phase3_shapeit2_mvncall_integrated_v5_extra_anno.20130502.genotypes.bcf.plink-temporary     --out plink.test     --allow-extra-chr
PLINK v1.90b2t 64-bit (20 Dec 2014)        https://www.cog-genomics.org/plink2
(C) 2005-2014 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink.test.log.
32054 MB RAM detected; reserving 16027 MB for main workspace.
84739846 variants loaded from .bim file.
2504 people (0 males, 0 females, 2504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to plink.test.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 2504 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.980324.
84739846 variants and 2504 people pass filters and QC.
Note: No phenotypes present.
Segmentation fault (core dumped)

real    3m3.203s
user    0m40.631s
sys     0m13.533s
chrchang commented 9 years ago

Thanks, I will try to reproduce and fix this tonight, and let you know if I need any more information to do so.

chrchang commented 9 years ago

The January 8 development build should fix this; let me know if you still have any problems. (This was actually supposed to crash, but with a "sorting of files too large to fit in RAM has not been implemented yet" error message, rather than a segfault. The multipass sorting code was already 90% written, though, so I went ahead and added the last 10%.)