getian107 / PRScsx

Cross-population polygenic prediction
MIT License
65 stars 20 forks source link

Index Error occured #12

Closed BogyeomKim closed 2 years ago

BogyeomKim commented 3 years ago

Dear Dr. Tian,

I have encountered the following issue while running the PRS-csx:

`--ref_dir=/work2/07939/tg872382/stampede2/connectome/stampede2/PRScsx/ld_ref --bim_prefix=../ABCD_genotype2021/ABCD_QCed_2021_PCair_8620+1579_PRScs/ABCD_QCed_2021_PCair_8620+1579_PRScs_SNPrsid_final --sst_file=['../ABCD_summarystats/final_ASD_forPRScsx.txt'] --a=1 --b=0.5 --phi=1.0 --n_gwas=[46350] --pop=['EUR'] --n_iter=1000 --n_burnin=500 --thin=5 --out_dir=/work2/08170/amyk01/stampede2/ABCD_PRScsx/ASD/prscsx_output --out_name=ABCD_ASD_csx --chrom=['22'] --meta=FALSE --seed=None

process chromosome 22

... parse reference file: /work2/07939/tg872382/stampede2/connectome/stampede2/PRScsx/ld_ref/snpinfo_mult_1kg_hm3 ... ... 18944 SNPs on chromosome 22 read from /work2/07939/tg872382/stampede2/connectome/stampede2/PRScsx/ld_ref/snpinfo_mult_1kg_hm3 ... ... parse bim file: ../ABCD_genotype2021/ABCD_QCed_2021_PCair_8620+1579_PRScs/ABCD_QCed_2021_PCair_8620+1579_PRScs_SNPrsid_final.bim ... ... 154308 SNPs on chromosome 22 read from ../ABCD_genotype2021/ABCD_QCed_2021_PCair_8620+1579_PRScs/ABCD_QCed_2021_PCair_8620+1579_PRScs_SNPrsid_final.bim ... ... parse EUR sumstats file: ../ABCD_summarystats/final_ASD_forPRScsx.txt ... Traceback (most recent call last): File "/work2/07939/tg872382/stampede2/connectome/stampede2/PRScsx/PRScsx.py", line 204, in main() File "/work2/07939/tg872382/stampede2/connectome/stampede2/PRScsx/PRScsx.py", line 187, in main sst_dict[pp] = parse_genet.parse_sumstats(ref_dict, vld_dict, param_dict['sst_file'][pp], param_dict['pop'][pp], param_dict['n_gwas'][pp]) File "/work2/07939/tg872382/stampede2/connectome/stampede2/PRScsx/parse_genet.py", line 73, in parse_sumstats if ll[1] in ATGC and ll[2] in ATGC: IndexError: list index out of range`

The code I ran was as below: python3 /work2/07939/tg872382/stampede2/connectome/stampede2/PRScsx/PRScsx.py --ref_dir=/work2/07939/tg872382/stampede2/connectome/stampede2/PRScsx/ld_ref --bim_prefix=../ABCD_genotype2021/ABCD_QCed_2021_PCair_8620+1579_PRScs/ABCD_QCed_2021_PCair_8620+1579_PRScs_SNPrsid_final --sst_file=../ABCD_summarystats/final_ASD_forPRScsx.txt --n_gwas=46350 --pop=EUR --chrom=22 --phi=1 --out_dir=/work2/08170/amyk01/stampede2/ABCD_PRScsx/ASD/prscsx_output --out_name=ABCD_ASD_csx

I think it might come from indel frequencies in the summary stats file. I found my summary stat file contained some SNP with indel A1, A2 (e.g., A1 = ATG, ATTTT). I also found that after I removed indel SNP, the code worked.

I just wonder parse_genet.py could only deal with 'A', 'T', 'G', 'C' as A1 or A2 because about 10% SNP was lost when I excluded indel SNP.

Best regards, Bogyeom Kim.

getian107 commented 3 years ago

Hi Bogyeom- the program only uses ATGC alleles in prediction but it automatically removes indels. I think the error is likely caused by format issues for the indels in the summary statistics file. Do they also have 5 columns separated by the same delimiter?

BogyeomKim commented 3 years ago

Yes, I think so. They are separated by 'tab or space' but there are some other special characters. Is there any possibility to occur Type errors because of special characters? rs1835369049 GA G 1.03624 0.1732 rs1588375638 C CAAAAAAAAAAAA+1 0.98462 0.2982 rs1238492479 CAA CA 1.01167 0.7432 rs551065795 T TGA 1.14191 0.06265 rs1201441772 C CAAAA 0.97971 0.1843 rs138725384 CATCT CATCTATCT 0.99332 0.6796 rs1831598329 A G 1.00150 0.9704 rs542437283 T C 1.03873 0.3628 rs573130529 T C 0.96271 0.3628 rs1005563844 CTT CT 0.99263 0.66 rs377432397 A ACACCT 0.96127 0.1399 rs1832552750 GTTTTTT G 0.99581 0.8282 rs1363260291 T TACACACACACAC 1.02327 0.7241 rs560859299 T G 1.11405 0.08013 rs1564281434 A 1.01136 0.4879 rs1554763846 G GAAAAA 1.00753 0.6193 rs533825754 CAT C 1.01096 0.4729 rs577175383 A G 0.90620 0.2593 rs533544459 A G 1.09341 0.3175 rs1830883390 T C 0.99293 0.8013 rs577656066 A T 0.97707 0.15 rs543096513 A C 0.97785 0.1615 rs571778589 T C 1.02224 0.3198 rs146397837 ACTATCTAT ACTATCTATCTAT 0.99820 0.9106 rs1554774335 G GT 0.97025 0.03794 rs1312402150 G 1.02562 0.2297 rs1588350854 A AT 0.98010 0.1575 rs560753750 A G 0.88816 0.08799 rs1201461609 T G 1.05802 0.02635 rs1161174426 T TA 1.04300 0.3379 rs111255034 A C 0.94290 0.06982 rs1588371417 T C 1.01908 0.761 rs371317046 GGT G 1.03562 0.1578 rs1433195122 T 1.01633 0.4812 rs5782635 GTT GT 0.97599 0.1879 rs547618855 T C 0.96802 0.5546 rs368792767 GCCCAC G 0.98590 0.7689 rs551388264 A AGT 1.02532 0.7203 rs1301498717 GA G 1.01908 0.2355

2021년 8월 26일 (목) 오후 11:53, Tian Ge @.***>님이 작성:

Hi Bogyeom- the program only uses ATGC alleles in prediction but it automatically removes indels. I think the error is likely caused by format issues for the indels in the summary statistics file. Do they also have 5 columns separated by the same delimiter?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/getian107/PRScsx/issues/12#issuecomment-906483589, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQD26XGVW5ZUNMSZQ2BLCP3T6ZINZANCNFSM5C2QNZ7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

getian107 commented 3 years ago

Yes I think these special characters might be the cause. If it's not too annoying you can remove indels from the summary stats before running the algorithm. This wouldn't reduce prediction power since PRS-CS(x) only uses SNPs for prediction.