choishingwan / PRSice

A software package for calculating, applying, evaluating and plotting the results of polygenic risk scores
http://prsice.info
GNU General Public License v3.0
178 stars 84 forks source link

Obtaining PRS from a single individual using PGS Catalog files #302

Open aheritas opened 1 year ago

aheritas commented 1 year ago

Hi! I have been reading the documentation and some of your answers in forums. I understand that it is possible to calculate the PRS score for a single individual using the PGS Catalog files. I have tried to do so using PRSice-2 but I have been unsuccessful. I am sharing here my detailed steps and I would be grateful if you could guide me into how to troubleshoot this.

I want to calculate the PRS for breast cancer (PGS000004). I know the fact that, the scoring file for this particular PRS, does not include RSIDs but genomic positions (I am using this harmonized file for GRCh37). According to some of your answers in a forum, when using PGS Catalog files, we should add an additional column including all 1 or 0 as p-values. I modify this file to include a new column, (named p_value) that contains all 1 resulting in PGS000004_withpval.txt.

My input file is a a VCF obtained from imputation software, containing approx. 80M variants. The first I do is to normalize this VCF using bcftools, so that there is one single row per genomic position.

bcftools norm -m +any -O z -o NORMALIZED_VCF /home/user/data/ORIGINALFILEVCF_imputed.vcf.gz

Then, I transform this file into the necessary input files for PRSice (.bed, .bim, .fam) using PLINK v1.9.

plink --vcf /home/user/data/NORMALIZED_VCF.vcf.gz --snps-only --make-bed --out NORM_PLINK_VCF

Finally, I run PRSice, with the following parameters:

Rscript /home/user/data/PRSice.R --prsice /home/user/data/PRSice_linux --base /home/user/data/PGS000004_withpval.txt --a1 effect_allele --a2 other_allele --stat effect_weight --pvalue p_value --beta --bp chr_position --chr chr_name --chr-id c:l-ab --target NORM_PLINK_VCF --no-clump --out Output_NORM_PLINK_VCF_PRSice The script runs, but I get the following error:

81192144 variant(s) not found in previous data 
237 variant(s) included 

There are a total of 1 phenotype to process 

Processing the 1 th phenotype 

Phenotype is a continuous phenotype 

Only one phenotype value detected and they are all -9. Not 
enough valid phenotype 

So, I understand there is a problem with the phenotype file. The phenotype of this file, is unknown, that's why I want to calculate the PRS, but perhaps I am incorrectly adding some extra parameters that are not necessary. Would you mind guiding me to this calculation? Thank you very much!

choishingwan commented 1 year ago

You want to also add --no-regress. As you don't need to do the regression to optimize parameter.

On Mon, Sep 12, 2022, 11:05 AM aheritas @.***> wrote:

Hi! I have been reading the documentation and some of your answers in forums. I understand that it is possible to calculate the PRS score for a single individual using the PGS Catalog files. I have tried to do so using PRSice-2 but I have been unsuccessful. I am sharing here my detailed steps and I would be grateful if you could guide me into how to troubleshoot this.

I want to calculate the PRS for breast cancer (PGS000004 https://www.pgscatalog.org/score/PGS000004/). I know the fact that, the scoring file for this particular PRS, does not include RSIDs but genomic positions (I am using this harmonized file https://ftp.ebi.ac.uk/pub/databases/spot/pgs/scores/PGS000004/ScoringFiles/Harmonized/PGS000004_hmPOS_GRCh37.txt.gz for GRCh37). According to some of your answers in a forum https://www.biostars.org/p/9463113/, when using PGS Catalog files, we should add an additional column including all 1 or 0 as p-values. I modify this file to include a new column, (named p_value) that contains all 1 resulting in PGS000004_withpval.txt https://github.com/choishingwan/PRSice/files/9549230/PGS000004_withpval.txt .

My input file is a a VCF obtained from imputation software, containing approx. 80M variants. The first I do is to normalize this VCF using bcftools, so that there is one single row per genomic position.

bcftools norm -m +any -O z -o NORMALIZED_VCF /home/user/data/ORIGINALFILEVCF_imputed.vcf.gz

Then, I transform this file into the necessary input files for PRSice (.bed, .bim, .fam) using PLINK v1.9.

plink --vcf /home/user/data/NORMALIZED_VCF.vcf.gz --snps-only --make-bed --out NORM_PLINK_VCF

Finally, I run PRSice, with the following parameters:

Rscript /home/user/data/PRSice.R --prsice /home/user/data/PRSice_linux --base /home/user/data/PGS000004_withpval.txt --a1 effect_allele --a2 other_allele --stat effect_weight --pvalue p_value --beta --bp chr_position --chr chr_name --chr-id c:l-ab --target NORM_PLINK_VCF --no-clump --out Output_NORM_PLINK_VCF_PRSice The script runs, but I get the following error:

81192144 variant(s) not found in previous data 237 variant(s) included

There are a total of 1 phenotype to process

Processing the 1 th phenotype

Phenotype is a continuous phenotype

Only one phenotype value detected and they are all -9. Not enough valid phenotype

So, I understand there is a problem with the phenotype file. The phenotype of this file, is unknown, that's why I want to calculate the PRS, but perhaps I am incorrectly adding some extra parameters that are not necessary. Would you mind guiding me to this calculation? Thank you very much!

— Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/302, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYQCJASR77SN3KUTMO3V55BDXANCNFSM6AAAAAAQKSAPAY . You are receiving this because you are subscribed to this thread.Message ID: @.***>