chrchang / plink-ng

A comprehensive update to the PLINK association analysis toolset. Beta testing of the first new version (1.90), focused on speed and memory efficiency improvements, is finishing up. Development is now focused on building out support for multiallelic, phased, and dosage data in PLINK 2.0.
https://www.cog-genomics.org/plink/2.0/
408 stars 123 forks source link

`Error: Non-concatenating --pmerge[-list] is under development.` #232

Open jacorvar opened 1 year ago

jacorvar commented 1 year ago

Hi,

running Plink (v2.00a4LM AVX2 Intel) errors out when merging multiple datasets.

$ ../plink2 --debug --memory 8000 --threads 6 --pmerge-list input_sources.txt --out merged
PLINK v2.00a4LM AVX2 Intel (9 Jan 2023)        www.cog-genomics.org/plink/2.0/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to merged.log.
Options in effect:
  --debug
  --memory 8000
  --out merged
  --pmerge-list input_sources.txt
  --threads 6

Start time: Mon Jan 23 17:06:52 2023
385417 MiB RAM detected; reserving 8000 MiB for main workspace.
Using up to 6 compute threads.
--pmerge-list: 2 filesets specified.
--pmerge-list: 2 samples present.
--pmerge-list: Merged .psam written to merged.psam .
--pmerge-list: 2 .pvar files scanned, headers merged.
Error: Non-concatenating --pmerge[-list] is under development.

Contents of input_sources.txt:

$ cat input_sources.txt 
test3
test4

test3 and test4 have been generated from VCF files:

$ plink2 --vcf ../3.vcf.gz --out test3 --make-pgen
$ plink2 --vcf ../4.vcf.gz --out test4 --make-pgen

I'm a newbie with Plink and suspect I'm doing something wrong but after some digging I've found no clue.

System specs: CentOS 7.9, Intel(R) Xeon(R) Silver 4210R

chrchang commented 1 year ago

The error message means exactly what it says: this feature isn't implemented in plink2 yet. ("Concatenating" merge refers to the "bcftools concat" use case, though plink2's behavior differs a bit from bcftools's here.) Use e.g. bcftools or plink 1.9 to merge for now.

myz540 commented 1 year ago

The error message means exactly what it says: this feature isn't implemented in plink2 yet. ("Concatenating" merge refers to the "bcftools concat" use case, though plink2's behavior differs a bit from bcftools's here.) Use e.g. bcftools or plink 1.9 to merge for now.

Are you sure? as of march 13th, we were able to use plink2 to concat data sets.

Here is a log of a working example:

PLINK v2.00a3.7LM AVX2 Intel (24 Oct 2022)
Options in effect:
  --out ukb24068_c5_merged_sample_filtered
  --pfile ukb24068_c5_b1_merged_sample_filtered
  --pmerge-list chr5_list

Hostname: 80b217465abd
Working directory: /home/ubuntu/exome_pgen
Start time: Mon Mar 13 15:49:53 2023

Random number seed: 1678722593
63628 MiB RAM detected; reserving 31814 MiB for main workspace.
Using up to 16 threads (change this with --threads).
--pmerge-list: 19 filesets specified (including main fileset).
--pmerge-list: 422625 samples present.
--pmerge-list: Merged .psam written to ukb24068_c5_merged_sample_filtered.psam
.
--pmerge-list: 19 .pvar files scanned, headers merged.
Concatenation job detected.
Concatenating... 747813/747813 variants complete.
Results written to ukb24068_c5_merged_sample_filtered.pgen +
ukb24068_c5_merged_sample_filtered.pvar .

End time: Mon Mar 13 15:51:11 2023

However, we see this same error for 2 of our chromosomes, not sure why yet. Same code is run in a loop, the pvar and psam files are made, but the pgen file is not produced. Any ideas?

PLINK v2.00a3.7LM AVX2 Intel (24 Oct 2022)
Options in effect:
  --out ukb24068_c8_merged_sample_filtered
  --pfile ukb24068_c8_b1_merged_sample_filtered
  --pmerge-list chr8_list

Hostname: 80b217465abd
Working directory: /home/ubuntu/exome_pgen
Start time: Mon Mar 13 15:54:05 2023

Random number seed: 1678722845
63628 MiB RAM detected; reserving 31814 MiB for main workspace.
Using up to 16 threads (change this with --threads).
--pmerge-list: 15 filesets specified (including main fileset).
--pmerge-list: 422625 samples present.
--pmerge-list: Merged .psam written to ukb24068_c8_merged_sample_filtered.psam
.
--pmerge-list: 15 .pvar files scanned, headers merged.
Error: Non-concatenating --pmerge-list is under development.

End time: Mon Mar 13 15:54:10 2023

@gulumk for visibility

chrchang commented 1 year ago

When two variants share a position, --pmerge-list uses the --sort-vars setting (https://www.cog-genomics.org/plink/2.0/data#sort_vars ) to determine their output order. In particular, if the end of one .pvar and the beginning of the next have variants at the same position, and their IDs are in the wrong order, --pmerge-list can no longer "concatenate".

I will update the online documentation today to spell this out.

myz540 commented 1 year ago

When two variants share a position, --pmerge-list uses the --sort-vars setting (https://www.cog-genomics.org/plink/2.0/data#sort_vars ) to determine their output order. In particular, if the end of one .pvar and the beginning of the next have variants at the same position, and their IDs are in the wrong order, --pmerge-list can no longer "concatenate".

I will update the online documentation today to spell this out.

I see, thank you for the quick reply. Would you say that inspecting the heads and tails of the pvar files is a good place to start? Is this issue strictly due to the pvar file or could issues in the pgen file throw this error as well?

chrchang commented 1 year ago
  1. Yes; if you don't want to resort to exporting to BCF and using "bcftools concat", one option is temporarily editing the offending leading/trailing variant IDs so that they no longer violate --sort-vars order.
  2. No, pgen file contents can't cause this.
myz540 commented 1 year ago

Thanks @chrchang , we were able to resolve our issue

123huynguyen commented 1 year ago

@myz540 Hi Mike, would you mind providing me with your codes to address this issue since I got the same issue as yours? I really look forward to receiving your help.

vicentepese commented 1 month ago

Are there any updates on this?

myz540 commented 1 month ago

@myz540 Hi Mike, would you mind providing me with your codes to address this issue since I got the same issue as yours? I really look forward to receiving your help.

Hey @123huynguyen, I would love to help but this was at an old job so I no longer have access to the code base or the context required to provide you a solution. I believe the issue was in the sorting, when we inspected the pvar file head and tail, we saw that the chunks weren't sorted correctly. I can't be 100% that was the issue given how long it's been but hope this helps