Closed Hoeze closed 2 years ago
Hi, this looks like a plink2 bug, perhaps in --pmerge-list. If necessary, I will send you a sequence of debug builds to help us get to the bottom of this, though if you can post a group of --pmerge-list input filesets that exhibits the same error (could be a lot smaller than 200643 x 34977), I'll try to reproduce and fix the bug directly from that.
What's the output of "plink2 --pfile [filename prefix] --pgen-info" on each of the --pmerge-list input filesets?
@chrchang Thanks for the quick answer. Unfortunately, I do not know the cause because the first 44 parts could be merged without any issues. Only the 45th part seems to make problems.
Also, I cannot share the files with you (privacy-related data...). If you do have access to UK Biobank data yourself, you can easily reproduce the issue. Otherwise, I'm happy to run any debug builds for you :)
The requested pgen-info:
Okay. I brain-farted and meant --validate rather than --pgen-info (though the --pgen-info output isn't totally useless to me); sorry about that.
If both files validate properly, I'll post a debug build for you.
@chrchang Here is the validate output:
First debug build is posted at https://s3.amazonaws.com/plink2-assets/plink2_linux_avx2_20220808a.zip ; or you can build 3f30579 from source.
Try running the failing --pmerge-list command with this build, after adding the --debug flag.
@chrchang Here is the output:
PLINK v2.00a3.5LM AVX2 Intel (8 Aug 2022) www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.log.
Options in effect:
--debug
--memory 8000
--out /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k
--pmerge-list mergelist.txt
--threads 6
Start time: Tue Aug 9 10:43:39 2022
515572 MiB RAM detected; reserving 8000 MiB for main workspace.
Using up to 6 compute threads.
--pmerge-list: 2 filesets specified.
--pmerge-list: 200643 samples present.
--pmerge-list: Merged .psam written to
/s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.psam .
--pmerge-list: 2 .pvar files scanned, headers merged.
Concatenation job detected.
Concatenating... 0/34977 variants complete.DEBUG (MergePgenVariantNoTmpLocked: simple_first_allele_remap branch: failed to read variant 6966
merge_rec_ct=1 write_allele_ct=2 allele_remap_stride=2
Error: .pgen file read failure: File appears to be corrupted.
DEBUG (ConcatPvariantPos): cur_bp=97920982 variant_ct=1 rec_idx_start=0
DEBUG (PmergeConcat): cur_bp > prev_bp branch
End time: Tue Aug 9 10:43:40 2022
Thanks. Next debug build is at https://s3.amazonaws.com/plink2-assets/plink2_linux_avx2_20220809a.zip (or c399fda ). This should be run with the same command.
@chrchang debug nr. 2:
PLINK v2.00a3.5LM AVX2 Intel (9 Aug 2022) www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.log.
Options in effect:
--debug
--memory 8000
--out /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k
--pmerge-list mergelist.txt
--threads 6
Start time: Tue Aug 9 17:26:00 2022
515572 MiB RAM detected; reserving 8000 MiB for main workspace.
Using up to 6 compute threads.
--pmerge-list: 2 filesets specified.
--pmerge-list: 200643 samples present.
--pmerge-list: Merged .psam written to
/s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.psam .
--pmerge-list: 2 .pvar files scanned, headers merged.
Concatenation job detected.
Concatenating... 0/34977 variants complete.DEBUG (PmergeConcat): fileset_idx=0 mr.sample_ct=200643 sample_idx_increasing=0
DEBUG (PmergeConcat): pgfi_alloc addr=7fab2ef7c380 next=7fab2efa4600
DEBUG (PgfiInitPhase2): vrtypes_iter addr=7fab2ef80ada
DEBUG (PmergeConcat): pgr_alloc addr=7fab2efa4600 next=7fab2effeb80
DEBUG (PgrInit): pgr_alloc_iter addr=7fab2effeb80
DEBUG (ReadGenovecSubsetUnsafe): vrtype=4
DEBUG (ReadGenovecSubsetUnsafe): Non-LD InitReadPtrs fail; fread_ptr=7fff70f4f2d7 fread_end=0
DEBUG (MergePgenVariantNoTmpLocked: simple_first_allele_remap branch: failed to read variant 6966
merge_rec_ct=1 write_allele_ct=2 allele_remap_stride=2
Error: .pgen file read failure: File appears to be corrupted.
DEBUG (ConcatPvariantPos): cur_bp=97920982 variant_ct=1 rec_idx_start=0
DEBUG (PmergeConcat): cur_bp > prev_bp branch
End time: Tue Aug 9 17:26:00 2022
Ok. 3rd debug build: https://s3.amazonaws.com/plink2-assets/plink2_linux_avx2_20220809b.zip (or 0ea4a0f ).
@chrchang Nr. 3:
PLINK v2.00a3.5LM AVX2 Intel (9 Aug 2022) www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.log.
Options in effect:
--debug
--memory 8000
--out /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k
--pmerge-list mergelist.txt
--threads 6
Start time: Tue Aug 9 18:40:12 2022
515572 MiB RAM detected; reserving 8000 MiB for main workspace.
Using up to 6 compute threads.
--pmerge-list: 2 filesets specified.
--pmerge-list: 200643 samples present.
--pmerge-list: Merged .psam written to
/s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.psam .
--pmerge-list: 2 .pvar files scanned, headers merged.
Concatenation job detected.
Concatenating... 0/34977 variants complete.DEBUG (PmergeConcat): fileset_idx=0 mr.sample_ct=200643 sample_idx_increasing=0
DEBUG (PmergeConcat): pgfi_alloc addr=7fbabd879380 next=7fbabd8a1600
DEBUG (PgfiInitPhase2): vrtypes_iter addr=7fbabd87dada
var_fpos[6966]=5449872 var_fpos[6967]=5449959
DEBUG (PmergeConcat): pgr_alloc addr=7fbabd8a1600 next=7fbabd8fbb80
DEBUG (PgrInit): pgr_alloc_iter addr=7fbabd8fbb80
DEBUG (InitReadPtrs): fread failed, cur_vrec_width=4294966615, errno=2
var_fpos[6966]=5450640 var_fpos[6967]=5449959 address=7fbabd88b4b8
DEBUG (MergePgenVariantNoTmpLocked: simple_first_allele_remap branch: failed to read variant 6966
merge_rec_ct=1 write_allele_ct=2 allele_remap_stride=2
Error: .pgen file read failure: File appears to be corrupted.
DEBUG (ConcatPvariantPos): cur_bp=97920982 variant_ct=1 rec_idx_start=0
DEBUG (PmergeConcat): cur_bp > prev_bp branch
End time: Tue Aug 9 18:40:13 2022
Debug build 4: https://s3.amazonaws.com/plink2-assets/plink2_linux_avx2_20220809c.zip (or bff2145 )
Nr. 4:
PLINK v2.00a3.5LM AVX2 Intel (9 Aug 2022) www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.log.
Options in effect:
--debug
--memory 8000
--out /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k
--pmerge-list mergelist.txt
--threads 6
Start time: Tue Aug 9 19:16:15 2022
515572 MiB RAM detected; reserving 8000 MiB for main workspace.
Using up to 6 compute threads.
--pmerge-list: 2 filesets specified.
--pmerge-list: 200643 samples present.
--pmerge-list: Merged .psam written to
/s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.psam .
--pmerge-list: 2 .pvar files scanned, headers merged.
Concatenation job detected.
Concatenating... 0/34977 variants complete.DEBUG (PmergeConcat): fileset_idx=0 sample_ct=200643 sample_idx_increasing=0
DEBUG (PmergeConcat): pgfi_alloc addr=7f454eea7380 next=7f454eecf600
DEBUG (PgfiInitPhase2): vrtypes_iter addr=7f454eeabada
var_fpos[6966]=5449872 var_fpos[6967]=5449959
DEBUG (PmergeConcat): pgr_alloc addr=7f454eecf600 next=7f454ef29b80
DEBUG (PgrInit): pgr_alloc_iter addr=7f454ef29b80
Debug build 5: https://s3.amazonaws.com/plink2-assets/plink2_linux_avx2_20220809d.zip (or 8e828ea )
Nr. 5:
PLINK v2.00a3.5LM AVX2 Intel (9 Aug 2022) www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.log.
Options in effect:
--debug
--memory 8000
--out /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k
--pmerge-list mergelist.txt
--threads 6
Start time: Tue Aug 9 19:27:19 2022
515572 MiB RAM detected; reserving 8000 MiB for main workspace.
Using up to 6 compute threads.
--pmerge-list: 2 filesets specified.
--pmerge-list: 200643 samples present.
--pmerge-list: Merged .psam written to
/s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.psam .
--pmerge-list: 2 .pvar files scanned, headers merged.
Concatenation job detected.
Concatenating... 0/34977 variants complete.DEBUG (PmergeConcat): fileset_idx=0 sample_ct=200643 sample_idx_increasing=0
DEBUG (PmergeConcat): pgfi_alloc addr=7f430da78380 next=7f430daa0600
DEBUG (PgfiInitPhase2): vrtypes_iter addr=7f430da7cada
var_fpos[6966]=5449872 var_fpos[6967]=5449959
DEBUG (PmergeConcat): pgr_alloc addr=7f430daa0600 next=7f430dafab80
DEBUG (PgrInit): pgr_alloc_iter addr=7f430dafab80
ConcatPvariantPos cur_bp=97720149 rec_idx_start=5 after MergePgenVariantNoTmpLocked
ConcatPvariantPos cur_bp=97720149 rec_idx_start=7 before MergePvariant
ConcatPvariantPos cur_bp=97720149 rec_idx_start=7 after MergePvariant
ConcatPvariantPos cur_bp=97720149 rec_idx_start=7 before MergePgenVariantNoTmpLocked
ConcatPvariantPos cur_bp=97720149 rec_idx_start=7 after MergePgenVariantNoTmpLocked
var_fpos corrupt at read_variant_idx=6423 step 5
Debug build 6: https://s3.amazonaws.com/plink2-assets/plink2_linux_avx2_20220809e.zip (or bd8ebf4 )
Nr. 6:
PLINK v2.00a3.5LM AVX2 Intel (9 Aug 2022) www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.log.
Options in effect:
--debug
--memory 8000
--out /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k
--pmerge-list mergelist.txt
--threads 6
Start time: Tue Aug 9 19:59:56 2022
515572 MiB RAM detected; reserving 8000 MiB for main workspace.
Using up to 6 compute threads.
--pmerge-list: 2 filesets specified.
--pmerge-list: 200643 samples present.
--pmerge-list: Merged .psam written to
/s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.psam .
--pmerge-list: 2 .pvar files scanned, headers merged.
Concatenation job detected.
Concatenating... 0/34977 variants complete.DEBUG (PmergeConcat): fileset_idx=0 sample_ct=200643 sample_idx_increasing=0
DEBUG (PmergeConcat): pgfi_alloc addr=7f7fa95c5380 next=7f7fa95ed600
DEBUG (PgfiInitPhase2): vrtypes_iter addr=7f7fa95c9ada
var_fpos[6966]=5449872 var_fpos[6967]=5449959
DEBUG (PmergeConcat): pgr_alloc addr=7f7fa95ed600 next=7f7fa9647b80
DEBUG (PgrInit): pgr_alloc_iter addr=7f7fa9647b80
DEBUG (InitReadPtrs): fp_vidx was 6415, seeking to offset 5000130
DEBUG (InitReadPtrs): fp_vidx was 6418, seeking to offset 5013961
DEBUG (InitReadPtrs): fp_vidx was 6420, seeking to offset 4974035
DEBUG (InitReadPtrs): fp_vidx was 6417, seeking to offset 5014630
DEBUG (InitReadPtrs): fp_vidx was 6422, seeking to offset 5000130
DEBUG (InitReadPtrs): fp_vidx was 6418, seeking to offset 5014314
DEBUG (ConcatPvariantPos): merge_rec_ct=2 allele_ct=2 read_max_allele_ct=2
[1] 1077709 segmentation fault (core dumped) $TMP/plink2 --debug --pmerge-list mergelist.txt --out --threads 6 --memory
Debug build 7: https://s3.amazonaws.com/plink2-assets/plink2_linux_avx2_20220809f.zip (or 347f5f1 )
Nr. 7:
PLINK v2.00a3.5LM AVX2 Intel (9 Aug 2022) www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.log.
Options in effect:
--debug
--memory 8000
--out /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k
--pmerge-list mergelist.txt
--threads 6
Start time: Tue Aug 9 21:09:01 2022
515572 MiB RAM detected; reserving 8000 MiB for main workspace.
Using up to 6 compute threads.
--pmerge-list: 2 filesets specified.
--pmerge-list: 200643 samples present.
--pmerge-list: Merged .psam written to
/s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.psam .
--pmerge-list: 2 .pvar files scanned, headers merged.
Concatenation job detected.
Concatenating... 0/34977 variants complete.DEBUG (PmergeConcat): fileset_idx=0 sample_ct=200643 sample_idx_increasing=0
DEBUG (PmergeConcat): pgfi_alloc addr=7faece451380 next=7faece479600
DEBUG (PgfiInitPhase2): vrtypes_iter addr=7faece455ada
var_fpos[6966]=5449872 var_fpos[6967]=5449959
DEBUG (PmergeConcat): pgr_alloc addr=7faece479600 next=7faece4d3b80
DEBUG (PgrInit): pgr_alloc_iter addr=7faece4d3b80
DEBUG (InitReadPtrs): fp_vidx was 6415, seeking to offset 5000130
DEBUG (InitReadPtrs): fp_vidx was 6418, seeking to offset 5013961
DEBUG (InitReadPtrs): fp_vidx was 6420, seeking to offset 4974035
DEBUG (InitReadPtrs): fp_vidx was 6417, seeking to offset 5014630
DEBUG (InitReadPtrs): fp_vidx was 6422, seeking to offset 5000130
DEBUG (InitReadPtrs): fp_vidx was 6418, seeking to offset 5014314
DEBUG (ConcatPvariantPos): merge_rec_ct=2 allele_ct=2 read_max_allele_ct=2
step 26 rec_idx=1
Debug build 8: https://s3.amazonaws.com/plink2-assets/plink2_linux_avx2_20220809g.zip (or 963670f )
Nr. 8:
PLINK v2.00a3.5LM AVX2 Intel (9 Aug 2022) www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.log.
Options in effect:
--debug
--memory 8000
--out /s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k
--pmerge-list mergelist.txt
--threads 6
Start time: Tue Aug 9 21:27:01 2022
515572 MiB RAM detected; reserving 8000 MiB for main workspace.
Using up to 6 compute threads.
--pmerge-list: 2 filesets specified.
--pmerge-list: 200643 samples present.
--pmerge-list: Merged .psam written to
/s/project/uk_biobank/processed/WES_200K/ukbb_wes_200k.psam .
--pmerge-list: 2 .pvar files scanned, headers merged.
Concatenation job detected.
Concatenating... 0/34977 variants complete.DEBUG (PmergeConcat): fileset_idx=0 sample_ct=200643 sample_idx_increasing=0
DEBUG (PmergeConcat): pgfi_alloc addr=7f2df04c8380 next=7f2df04f0600
DEBUG (PgfiInitPhase2): vrtypes_iter addr=7f2df04ccada
var_fpos[6966]=5449872 var_fpos[6967]=5449959
DEBUG (PmergeConcat): pgr_alloc addr=7f2df04f0600 next=7f2df054ab80
DEBUG (PgrInit): pgr_alloc_iter addr=7f2df054ab80
DEBUG (InitReadPtrs): fp_vidx was 6415, seeking to offset 5000130
DEBUG (InitReadPtrs): fp_vidx was 6418, seeking to offset 5013961
DEBUG (InitReadPtrs): fp_vidx was 6420, seeking to offset 4974035
DEBUG (InitReadPtrs): fp_vidx was 6417, seeking to offset 5014630
DEBUG (InitReadPtrs): fp_vidx was 6422, seeking to offset 5000130
DEBUG (InitReadPtrs): fp_vidx was 6418, seeking to offset 5014314
DEBUG (ConcatPvariantPos): merge_rec_ct=2 allele_ct=2 read_max_allele_ct=2
step 25e rec_idx=1 widx=6270 new_sample_idx=1048772 new_word_idx=32774
Ok, check if this fixes the bug: https://s3.amazonaws.com/plink2-assets/plink2_linux_avx2_20220809h.zip (or 927db87 )
Nice :tada: Thanks a lot @chrchang, this fixes the bug!
Hi, I'm trying to convert the UK Biobank 200k Whole Exome Sequencing dataset to a single plink2 dataset:
The error appears in a single pgen part that was generated like this:
This reproducible fails with that issue. What could be the issue there?