choishingwan / PRSice

A software package for calculating, applying, evaluating and plotting the results of polygenic risk scores
http://prsice.info
GNU General Public License v3.0
180 stars 85 forks source link

Question about --pheno file #221

Closed hershwin closed 3 years ago

hershwin commented 4 years ago

Hi Sam, I have a subgroup C from the UK biobank that I want to calculate PRSs for. If I only give phenotype file for C group versus for the entire cohort I get different PRS's. Do you know why this may be? Every other flag is the same. I also notice that if I only give the C group phenotype file then many of the PRSs are the same. Thanks!

hershwin commented 4 years ago

If I use the --keep flag I also get a different set of PRS. For example, here are the three sets of PRSs I get using different options:

ID PRS_keep PRS_subset PRS_full 1 0.0548214 4998803100 40.97292 2 0.0548214 8848812520 50.76872 3 0.0548214 9319369220 49.64093 4 0.0548214 9383536050 47.53339 5 0.0548214 9383536050 55.44406 6 0.0405525 9383536050 57.56381

Individuals 1-6 are all individuals in subgroup C (total in group C is about 2k) PRS_keep is when I use --keep flag on only the the subgroup but give the phenotype file for the entire UKB population PRS_subset is if I don't use the --keep flag but give only the phenotype file for subgroup C PRS_all if is if I don't use the --keep flag but give phenotype file for entire UKB population

Thanks for your help!

choishingwan commented 4 years ago

Hi,

Sorry, weekend is family time so I couldn't reply in time.

Could you please send me the full log for each of your runs? And did you use the same version of PRSice for each run? Your PRS_subset result doesn't make sense to me as we almost never expect to see such a large number. AS for your PRS_keep and PRS_full, it seems like one is done using --score avg and the other is using --score sum (respectively). While I do expect the PRS to differ when you keep different samples, I don't expect it to be so different even in situations where the data hasn't gone over QC.

I'll need much more information to see what's the problem. Sam

On Mon, Aug 24, 2020 at 5:41 AM hershwin notifications@github.com wrote:

If I use the --keep flag I also get a different set of PRS. For example, here are the three sets of PRSs I get using different options:

ID PRS_keep PRS_subset PRS_full 1 0.0548214 4998803100 40.97292 2 0.0548214 8848812520 50.76872 3 0.0548214 9319369220 49.64093 4 0.0548214 9383536050 47.53339 5 0.0548214 9383536050 55.44406 6 0.0405525 9383536050 57.56381

Individuals 1-6 are all individuals in subgroup C (total in group C is about 2k) PRS_keep is when I use --keep flag on only the the subgroup but give the phenotype file for the entire UKB population PRS_subset is if I don't use the --keep flag but give only the phenotype file for subgroup C PRS_all if is if I don't use the --keep flag but give phenotype file for entire UKB population

Thanks for your help!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/221#issuecomment-678828496, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYQV7QMV3OLFB2JESA3SCGEHHANCNFSM4QIKC2MQ .

hershwin commented 4 years ago

Hi Sam, no worries thanks for your help :)

  1. I used PRSice v2.3.0 for PRS_subset and PRS_full (before the -keep flag was fixed), and v2.3.3 for PRS_keep.
  2. I use --score sum for all three cases
  3. Not sure if this is helpful, but the correlations between the three PRS are really low, r<0.1

Heres the log for PRS_keep (using chromosome 9 as an example): Rscript PRSice.R --dir . \ --prsice ./PRSice_linux \ --base fev1.stats.bgen \ --pvalue P_BOLT_LMM_INF \ --a1 ALLELE1 \ --a2 ALLELE0 \ --target ukb22881_imp_chr1_v3_s487324.sample \ --type bgen \ --extract chr9_bgen.valid \ --pheno /PRS/phenotypes_all.tab \ --pheno-col fev1_inst1 \ --keep /data_frames/C_ID_spiro.txt \ --cov /PRS/phenotypes_all.tab \ --cov-col @pc[1-40],f.50.0.0, assesment_center, f.31.0.0 \ --cov-factor assesment_center, f.31.0.0 \ --score sum \ --out C_chr9_bgen

Heres the log for PRS_subset Rscript PRSice.R --dir . \ --prsice ./PRSice_linux \ --base fev1.stats.bgen \ --pvalue P_BOLT_LMM_INF \ --a1 ALLELE1 \ --a2 ALLELE0 \ --target ukb22881_imp_chr1_v3_s487324.sample \ --type bgen \ --extract chr9_bgen.valid \ --pheno /PRS/phenotypes_C.tab \ --pheno-col fev1_inst1 \ --cov /PRS/phenotypes_C.tab \ --cov-col @pc[1-40],f.50.0.0, assesment_center, f.31.0.0 \ --cov-factor assesment_center, f.31.0.0 \ --score sum \ --out C_chr9_bgen

Heres the log for PRS_full, I then manually select individuals in group C

Rscript PRSice.R --dir . \ --prsice ./PRSice_linux \ --base fev1_inst1.stats.bgen \ --pvalue P_BOLT_LMM_INF \ --a1 ALLELE1 \ --a2 ALLELE0 \ --target ukb22881_imp_chr1_v3_s487324.sample \ --type bgen \ --extract chr9_bgen.valid \ --pheno /PRS/phenotypes_all.tab \ --pheno-col fev1_inst1 \ --cov /PRS/phenotypes_all.tab \ --cov-col @pc[1-40],f.50.0.0, assesment_center, f.31.0.0 \ --cov-factor assesment_center, f.31.0.0 \ --score sum \ --out chr9_all_bgen

Sorry if this is a little confusing, thanks again!

choishingwan commented 4 years ago

Oh, that’s why. 2.3.0 is problematic. Could you please try using 2.3.3? I should’ve put a flag on my website and ask people not to use 2.3.0

Sam

On Tue, 25 Aug 2020 at 2:34 AM, hershwin notifications@github.com wrote:

Hi Sam, no worries thanks for your help :)

  1. I used PRSice v2.3.0 for PRS_subset and PRS_full (before the -keep flag was fixed), and v2.3.3 for PRS_keep.

  2. I use --score sum for all three cases

  3. Not sure if this is helpful, but the correlations between the three PRS are really low, r<0.1

Heres the log for PRS_keep (using chromosome 9 as an example):

Rscript PRSice.R --dir .

--prsice ./PRSice_linux

--base fev1.stats.bgen

--pvalue P_BOLT_LMM_INF

--a1 ALLELE1

--a2 ALLELE0

--target ukb22881_imp_chr1_v3_s487324.sample

--type bgen

--extract chr9_bgen.valid

--pheno /PRS/phenotypes_all.tab

--pheno-col fev1_inst1

--keep /data_frames/C_ID_spiro.txt

--cov /PRS/phenotypes_all.tab

--cov-col @pc https://github.com/pc[1-40],f.50.0.0, assesment_center, f.31.0.0

--cov-factor assesment_center, f.31.0.0

--score sum

--out C_chr9_bgen

Heres the log for PRS_subset

Rscript PRSice.R --dir .

--prsice ./PRSice_linux

--base fev1.stats.bgen

--pvalue P_BOLT_LMM_INF

--a1 ALLELE1

--a2 ALLELE0

--target ukb22881_imp_chr1_v3_s487324.sample

--type bgen

--extract chr9_bgen.valid

--pheno /PRS/phenotypes_C.tab

--pheno-col fev1_inst1

--cov /PRS/phenotypes_C.tab

--cov-col @pc https://github.com/pc[1-40],f.50.0.0, assesment_center, f.31.0.0

--cov-factor assesment_center, f.31.0.0

--score sum

--out C_chr9_bgen

Heres the log for PRS_full, I then manually select individuals in group C

Rscript PRSice.R --dir .

--prsice ./PRSice_linux

--base fev1_inst1.stats.bgen

--pvalue P_BOLT_LMM_INF

--a1 ALLELE1

--a2 ALLELE0

--target ukb22881_imp_chr1_v3_s487324.sample

--type bgen

--extract chr9_bgen.valid

--pheno /PRS/phenotypes_all.tab

--pheno-col fev1_inst1

--cov /PRS/phenotypes_all.tab

--cov-col @pc https://github.com/pc[1-40],f.50.0.0, assesment_center, f.31.0.0

--cov-factor assesment_center, f.31.0.0

--score sum

--out chr9_all_bgen

Sorry if this is a little confusing, thanks again!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/221#issuecomment-679294926, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYX3JMSDM6CMHH4GELDSCKXCNANCNFSM4QIKC2MQ .

-- Dr Shing Wan Choi Postdoctoral Fellow Genetics and Genomic Sciences Icahn School of Medicine, Mount Sinai, NYC

hershwin commented 4 years ago

Okay, I will try that, thanks! I am also wondering how to combine the PRS from different chromosomes. I read your reply to a similar question in the google groups: use the --score sum flag and combine scores from the best threshold. How would I choose the best threshold? Should I use the --all-score flag, sum across every threshold, and use the one with the best R2? Thanks!

choishingwan commented 4 years ago

Given the speed of PRSice, it is usually not beneficial to perform the PRS on each chromosome separately as that will just increase the complication and chances for error.

If speed is a concern, doing --thread 22 --ultra should give you more or less equivalent speed as calculating the PRS on each chromosome separately. This is because, we can now perform multi-threaded clumping and --ultra will pre-load all genotype into the memory, thus reducing the IO penalty.

hershwin commented 4 years ago

Okay I see, I am calculating by chromosome because our UKB bgen files are separated into one per chromosome.

choishingwan commented 4 years ago

Assuming your bgen files are named ukb_chr1.bgen, ukb_chr2.bgen

you can do

--target ukb_chr#

PRSice will automatically substitute the 1-22 into #

For BGEN files, you can further speed up the analysis using --allow-inter, but be warned that you'll likely need a lot of disk space (because we will generate a large intermediate file (~100-200GB) for clumping ). Also, --ultra doesn't work with dosage scoring so you can just ignore that option, unless you use --hard

hershwin commented 4 years ago

great thank you for your help!

hershwin commented 4 years ago

Hi Sam, sorry to bother you again - I have been giving the jobs 200 GB with the flag --allow-inter and I am getting the error after running for 3-4 days:

Start performing clumping

std::bad_alloc

Error in fread(paste0(argv$out, ".summary"), data.table = F) : File '/PRSice_PRS/chrall_bgen.summary' does not exist or is non-readable. getwd()=='/PRSice_PRS'

Is this due to not enough memory? Thanks for your help.

choishingwan commented 4 years ago

Hi hershwin, what's the sample size that you are working with? How many SNPs were overlapped between the summary stat and the target? Are you working on 2.3.1?

It seems like PRSice crash at clumping stage. If it doesn't even presented the percentage, I doubt that is caused by memory (coz we haven't even started reading in SNPs)

hershwin commented 4 years ago

Hi Sam, I am using the most updated version 2.3.3. Here is part of the log file:

Loading Genotype info from target

==================================================

487409 people (223006 male(s), 264318 female(s)) observed 487409 founder(s) included

7402K SNPs processed in /ukb_imp_chr1_v3.bgen
8129K SNPs processed in /ukb_imp_chr2_v3.bgen
6696K SNPs processed in /ukb_imp_chr3_v3.bgen
6555K SNPs processed in /ukb_imp_chr4_v3.bgen
6070K SNPs processed in /ukb_imp_chr5_v3.bgen
5751K SNPs processed in /ukb_imp_chr6_v3.bgen
5405K SNPs processed in /ukb_imp_chr7_v3.bgen
5282K SNPs processed in /ukb_imp_chr8_v3.bgen
4066K SNPs processed in /ukb_imp_chr9_v3.bgen
4562K SNPs processed in /ukb_imp_chr10_v3.bgen
4628K SNPs processed in /ukb_imp_chr11_v3.bgen
4431K SNPs processed in /ukb_imp_chr12_v3.bgen
3270K SNPs processed in /ukb_imp_chr13_v3.bgen
3037K SNPs processed in /ukb_imp_chr14_v3.bgen
2767K SNPs processed in /ukb_imp_chr15_v3.bgen
3089K SNPs processed in /ukb_imp_chr16_v3.bgen
2660K SNPs processed in /ukb_imp_chr17_v3.bgen
2599K SNPs processed in /ukb_imp_chr18_v3.bgen
2087K SNPs processed in /ukb_imp_chr19_v3.bgen
2082K SNPs processed in /ukb_imp_chr20_v3.bgen
1261K SNPs processed in /ukb_imp_chr21_v3.bgen
1255K SNPs processed in /ukb_imp_chr22_v3.bgen
82165946 variant(s) not found in previous data 9323 variant(s) with mismatch information 10920354 variant(s) included

Calculate MAF and perform filtering on target SNPs ==================================================

Calculating allele frequencies: 100.00% 10920354 variant(s) included

Phenotype file: /PRSice_PRS/PRS/phenotypes_all.tab Column Name of Sample ID: FID+IID Note: If the phenotype file does not contain a header, the column name will be displayed as the Sample ID which is expected.

There are a total of 1 phenotype to process

Start performing clumping

std::bad_alloc

Error in fread(paste0(argv$out, ".summary"), data.table = F) : File '/PRSice_PRS/chrall_bgen.summary' does not exist or is non-readable. getwd()=='/PRSice_PRS'

so I believe 487,409 individuals and 10,920,354 variants.

I am running a similar job but for another phenotype with the same set of individuals and SNPs. That has been running since 2020-08-29T 11:02:20 and the log file shows clumping at 4.82%.

Would you recommend to split up the sample into smaller groups instead?

Thanks!

choishingwan commented 4 years ago

That might be a bug. Have replicated it and we’ll try to fix it On Wed, 2 Sep 2020 at 12:36 PM, hershwin notifications@github.com wrote:

Hi Sam, I am using the most updated version 2.3.3. Here is part of the log file:

Loading Genotype info from target

==================================================

487409 people (223006 male(s), 264318 female(s)) observed

487409 founder(s) included

7402K SNPs processed in /ukb_imp_chr1_v3.bgen

8129K SNPs processed in /ukb_imp_chr2_v3.bgen

6696K SNPs processed in /ukb_imp_chr3_v3.bgen

6555K SNPs processed in /ukb_imp_chr4_v3.bgen

6070K SNPs processed in /ukb_imp_chr5_v3.bgen

5751K SNPs processed in /ukb_imp_chr6_v3.bgen

5405K SNPs processed in /ukb_imp_chr7_v3.bgen

5282K SNPs processed in /ukb_imp_chr8_v3.bgen

4066K SNPs processed in /ukb_imp_chr9_v3.bgen

4562K SNPs processed in /ukb_imp_chr10_v3.bgen

4628K SNPs processed in /ukb_imp_chr11_v3.bgen

4431K SNPs processed in /ukb_imp_chr12_v3.bgen

3270K SNPs processed in /ukb_imp_chr13_v3.bgen

3037K SNPs processed in /ukb_imp_chr14_v3.bgen

2767K SNPs processed in /ukb_imp_chr15_v3.bgen

3089K SNPs processed in /ukb_imp_chr16_v3.bgen

2660K SNPs processed in /ukb_imp_chr17_v3.bgen

2599K SNPs processed in /ukb_imp_chr18_v3.bgen

2087K SNPs processed in /ukb_imp_chr19_v3.bgen

2082K SNPs processed in /ukb_imp_chr20_v3.bgen

1261K SNPs processed in /ukb_imp_chr21_v3.bgen

1255K SNPs processed in /ukb_imp_chr22_v3.bgen

82165946 variant(s) not found in previous data

9323 variant(s) with mismatch information

10920354 variant(s) included

Calculate MAF and perform filtering on target SNPs

Calculating allele frequencies: 100.00%

10920354 variant(s) included

Phenotype file:

/PRSice_PRS/PRS/phenotypes_all.tab

Column Name of Sample ID: FID+IID

Note: If the phenotype file does not contain a header, the

column name will be displayed as the Sample ID which is

expected.

There are a total of 1 phenotype to process

Start performing clumping

std::bad_alloc

Error in fread(paste0(argv$out, ".summary"), data.table = F) :

File '/PRSice_PRS/chrall_bgen.summary' does not exist or is non-readable. getwd()=='/PRSice_PRS'

so I believe 487,409 individuals and 10,920,354 variants.

I am running a similar job but for another phenotype with the same set of individuals and SNPs. That has been running since 2020-08-29T 11:02:20 and the log file shows clumping at 4.82%.

Would you recommend to split up the sample into smaller groups instead?

Thanks!

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/221#issuecomment-685289620, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYQSCZRK4LHIGM4HJOTSDXDULANCNFSM4QIKC2MQ .

-- Dr Shing Wan Choi Postdoctoral Fellow Genetics and Genomic Sciences Icahn School of Medicine, Mount Sinai, NYC

choishingwan commented 4 years ago

Think I might have fixed it, but I haven't been able to run the full bgen analysis at the moment due to server loading. If you use --thread 22, that should temporarily solve your problem. Basically, when using only a single thread, PRSice will aggressively reserve # sample /2 (or 4, I can't remember) # SNPs retained 0.2 * 8 byte of data, which total up to around an unrealistic ~2TB of memory in your case. I have now fixed the PRSice memory reservation so that we will only reserve 0.2 of the biggest chromosome. In most case, that should drastically reduce the memory requirement to around 100Gb in your case (for single thread).

Once I manage to test the new build and am sure that has solved the problem, then I will publish it. As of now, try running PRSice with --thread 22, which might help.

hershwin commented 4 years ago

Okay thank you! I will give it a try and keep you updated.

hershwin commented 4 years ago

Hi Sam, I gave my job 200 GB of RAM and it ran out of memory. The clumping progress was at 0.01%. Would it help if I separate my job to smaller cohorts? ie split up my sample into 5 groups of ~100K individuals? Would this take up less storage?

choishingwan commented 4 years ago

never anticipate that it'd use that much memory. I will put it down in my todo list to hopefully reduce the memory footprint in the future. For now, the best way to reduce memory and speed things up will be to use plink2 to convert bgen into bed and supply that to PRSice using the --ld option.

You can randomly select 1000 samples and that should work well enough.

On Thu, Sep 10, 2020 at 9:03 PM hershwin notifications@github.com wrote:

Hi Sam, I gave my job 200 GB of RAM and it ran out of memory. The clumping progress was at 0.01%. Would it help if I separate my job to smaller cohorts? ie split up my sample into 5 groups of ~100K individuals? Would this take up less storage?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/221#issuecomment-690271004, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYQUKSTIU5SHKDSLZJDSFDFAVANCNFSM4QIKC2MQ .

hershwin commented 3 years ago

Hi Sam, okay I will do that then - I'm just curious, how were you able to calculate height and BMI PRS as mentioned in the your paper using imputed data with 2 million SNPs? Did you run into the same problems? Thanks!

choishingwan commented 3 years ago

That was an older version. What I did in this version is that I cache the genotype when doing clumping, which speed up prsice and especially PRSet but with the price of increased memory usage

It wasn’t too much of an issue for genotype data but is problematic for the imputed data which I should really handle once I’ve got the time

On Tue, 6 Oct 2020 at 1:06 PM, hershwin notifications@github.com wrote:

Hi Sam, okay I will do that then - I'm just curious, how were you able to calculate height and BMI PRS as mentioned in the your paper using imputed data with 2 million SNPs? Did you run into the same problems? Thanks!

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/221#issuecomment-704029615, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYQVCXKC5IJKHY4SAULSJKQURANCNFSM4QIKC2MQ .

-- Dr Shing Wan Choi Postdoctoral Fellow Genetics and Genomic Sciences Icahn School of Medicine, Mount Sinai, NYC

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.