choishingwan / PRSice

A software package for calculating, applying, evaluating and plotting the results of polygenic risk scores
http://prsice.info
GNU General Public License v3.0
187 stars 90 forks source link

Error: Sample mismatch between bgen and phenotype file! #282

Closed isabelleazimm closed 2 years ago

isabelleazimm commented 3 years ago

Hi there,

I wonder if you can help with an error I am getting using PRSice 2.3.1.e, which I think is to do with using --keep. I have used the exact same script on the full sample with no errors, but it runs very slowly (over 3 days), so I wanted to test my script on a subsample using --keep. I have created a text file of sample IDs to keep (FID and IID columns), and have checked the IDs appear in the same order as the .sample file (though obviously with some missing). Since the run on the full sample hasn't finished I can't say if it works perfectly, but it does at least move beyond this particular error!

I have copied the log file below (I think the error message included some sample IDs so I have changed them but kept the same format in case that helps), but please let me know if you need any additional information at all.

Thank you very much!

/lustre/home/zcbth29/software/PRSice_linux_2.3.1/PRSice_linux \ --a1 A1 \ --a2 A2 \ --allow-inter \ --bar-levels 0.001,0.01,0.05,0.1,0.2,0.5,1 \ --base PGC3_SCZ_wave3_public.v2.tsv \ --base-info INFO:0.9 \ --binary-target T \ --bp BP \ --chr CHR \ --clump-kb 250kb \ --clump-p 1.000000 \ --clump-r2 0.100000 \ --extract scz_prs_01nov21.valid \ --fastscore \ --keep cb_response_ids_qc.txt \ --no-regress \ --num-auto 22 \ --or \ --out CBONLY_scz_prs_02nov21 \ --pvalue P \ --seed 3708952387 \ --snp SNP \ --stat OR \ --target /lustre/projects/UKBiobank-500K-Full-Release-2018QC/UKB-QC/data_files/eur/C#_ukbb_v3_eur_indiv_variant_qc,ukb_fullsample.sample \ --thread 36 \ --type bgen

Initializing Genotype file: /lustre/projects/UKBiobank-500K-Full-Release-2018QC/UKB-QC/data_files/eur/C#_ukbb_v3_eur_indiv_variant_qc (bgen) With external fam file: ukb_fullsample.sample

Start processing PGC3_SCZ_wave3_public.v2 ==================================================

SNP extraction/exclusion list contains 5 columns, will assume first column contains the SNP ID

Base file: PGC3_SCZ_wave3_public.v2.tsv Header of file is:

CHR SNP BP A1 A2 FRQ_A_67390 FRQ_U_94015 INFO OR SE P ngt Direction HetISqt HetDf HetPVa Nca Nco Neff

7585077 variant(s) observed in base file, with: 2400169 variant(s) excluded based on user input 5184908 total variant(s) included from base file

Loading Genotype info from target ==================================================

Assume phenotype file has header line: FID IID

408480 people (0 male(s), 0 female(s)) observed 131970 founder(s) included

Error: Sample mismatch between bgen and phenotype file! Name in BGEN file is :1234567 and in phentoype file is: 7654321 7654321. Please note that PRSice require the bgen file and the .sample (or phenotype file if sample file is not provided) to have sample in the same order. (We might be able to losen this requirement in future when we have more time)

choishingwan commented 3 years ago

bgen is inherently slow, so that is unfortunately something expected. You can use chromosome 22 to test run your script.

As for the error message, it requires a perfect match between the sample information in your sample file and the phenotype encoding in your bgen. Maybe trying something like --delim _ will work (2.3.5+). I am trying to figure out a way to speed things up, but for now, it is unfortunately going to take forever.

isabelleazimm commented 3 years ago

Thank you forgetting back so quickly! I have tested the script on Chr22 and its working well and quickly. I am assuming it isn't possible to calcuate the score separately per chromosome and then combine them somehow at the end?

Looks like my best option might be to create plink binary format versions of the data to speed it up in that case!

Thanks again.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.