choishingwan / PRSice

A software package for calculating, applying, evaluating and plotting the results of polygenic risk scores
http://prsice.info
GNU General Public License v3.0
185 stars 87 forks source link

[BUG] related to unknown/sex chromosome? #280

Closed asthakhatiwada closed 2 years ago

asthakhatiwada commented 3 years ago

bug issue: seems related to processing the base data that contains SNP, CHR, BP, A1, OR, P. All SNPs are excluded as they are "on unknown/sex chromosome" -- this seems irrelevant for the base data.. but my target data doesn't include sex information.. is that what's causing the issue? Does PRSice not work when sex information is not present in the target data?

Error Log

PRSice 2.3.3 (2020-08-05) https://github.com/choishingwan/PRSice (C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly GNU General Public License v3 If you use PRSice in any published work, please cite: Choi SW, O'Reilly PF. PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data. GigaScience 8, no. 7 (July 1, 2019) 2021-10-14 17:00:40 ./PRSice_mac \ --a1 A1 \ --bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \ --base AA_base.txt.gz \ --binary-target T \ --bp BP \ --chr CHR \ --clump-kb 250kb \ --clump-p 1.000000 \ --clump-r2 0.100000 \ --interval 5e-05 \ --lower 5e-08 \ --num-auto 22 \ --or \ --out AA_PRS \ --pvalue P \ --seed 2742174452 \ --snp SNP \ --stat OR \ --target topmed_imputed_allchr_qc_phenotype_onlyAA \ --thread 1 \ --upper 0.5

Initializing Genotype file: topmed_imputed_allchr_qc_phenotype_onlyAA (bed)

Start processing AA_base.txt Base file: AA_base.txt.gz GZ file detected. Header of file is: SNP CHR BP A1 OR P

Reading 100.00% 789303 variant(s) observed in base file, with: 789303 variant(s) excluded as they are on unknown/sex chromosome 0 total variant(s) included from base file

Error: No valid variant remaining

Error: Execution halted

choishingwan commented 3 years ago

What is your chromosome encoding look like?

On Thu, Oct 14, 2021 at 7:07 PM Aastha Khatiwada @.***> wrote:

Assigned #280 https://github.com/choishingwan/PRSice/issues/280 to @choishingwan https://github.com/choishingwan.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/choishingwan/PRSice/issues/280#event-5466818263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTRYTXSQNLJHLOX5JRIJDUG5PEXANCNFSM5GAXNZFQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

asthakhatiwada commented 3 years ago

In both target and base file they are numeric between 1 and 22.

choishingwan commented 3 years ago

Mind showing the first 2~3 lines of your summary statistic file? As that is what the log is suggesting.

choishingwan commented 3 years ago

To clarify, this error is complaining all chromosome located in your base file either fall into the X / Y chromosome, or that we don't recognize the encoding. This has nothing to do with the sex information in your target file.

asthakhatiwada commented 3 years ago

below are the first few lines from my base file..

head(aa_base)

A tibble: 6 × 6

SNP CHR BP A1 OR P

1 rs1233625550 1 776158 T 0.747 0.0246 2 rs188068004 1 824504 A 1.78 0.0135 3 rs189710781 1 825811 T 0.644 0.0184 4 rs115451476 1 841171 A 1.78 0.0134 5 rs143117458 1 852093 T 1.78 0.0134 6 rs200103839 1 855530 G 1.78 0.0134
choishingwan commented 3 years ago

The reason of this error, is because your base file has an extra line in front. As a result, a frame shift occurs in your data, i.e.

You SNP IDs are now 1,2,3,4,5,6 etc and your chromosome numbers are rs1233XXXX which we cannot convert into chromosome information.

asthakhatiwada commented 3 years ago

I don't think that's true. If you look at the tibble dimension, it says it's 6 X 6 and the first column is the SNP column.. I copied the output from R which presents the 1, 2, .. 6 as row names only for line reference.. that column is not in the dataset.. I hope this makes sense...

asthakhatiwada commented 3 years ago

Also, if you look at the error log I posted yesterday, it is reading in the correct columns. I have copied part of the error log below...

--bp BP --chr CHR --clump-kb 250kb --clump-p 1.000000 --clump-r2 0.100000 --interval 5e-05 --lower 5e-08 --num-auto 22 --or --out AA_PRS --pvalue P --seed 2742174452 --snp SNP --stat OR

choishingwan commented 3 years ago

It read the correct header. But your column index starts after the header field (at least based on the data you showed me).

What is means is, the first column is correctly identified as SNP, but from the second row onward, the first column is the column index, not the SNP ID.

asthakhatiwada commented 3 years ago

You are correct - row names were added to my data that's why I was seeing that error log. I was able to resolve that particular issue by removing those row names...thanks for your help!