getian107 / PRScsx

Cross-population polygenic prediction
MIT License
65 stars 20 forks source link

KeyError: '\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1aC' #47

Closed sahwa closed 5 months ago

sahwa commented 6 months ago

Hi,

I am encountering strange error when trying to run chromosome 1 on PRScsx. When we try run any other chromosome 2-22, then it works fine.

Here is the command:

(prscsx) [16:04:27] aey472@compa008/well/ckb/users/aey472/projects/pgs_subtype/scripts$ python ${prscsx} \
>   --ref_dir=/well/ckb/users/aey472/projects/pgs_subtype/data/LDref \
>   --bim_prefix=/gpfs3/well/ckb/users/dma206/prs/data/ckb_rs/chr${chr}_rs \
>   --sst_file=${ldscdir}/${bname}.ACTG.txt \
>   --n_gwas=${neff} \
>   --pop=${pop} \
>   --out_dir=${ldscdir} \
>   --out_name=${bname}.ACTG \
>   --chrom=${chr}

--ref_dir=/well/ckb/users/aey472/projects/pgs_subtype/data/LDref
--bim_prefix=/gpfs3/well/ckb/users/dma206/prs/data/ckb_rs/chr1_rs
--sst_file=['/well/ckb/users/aey472/projects/pgs_subtype/data/sumstats/ldsc/GCST90002346_mean_platelet_volume_EUR.ACTG.txt']
--a=1
--b=0.5
--phi=None
--n_gwas=[562243]
--pop=['EUR']
--n_iter=1000
--n_burnin=500
--thin=5
--out_dir=/well/ckb/users/aey472/projects/pgs_subtype/data/sumstats/ldsc
--out_name=GCST90002346_mean_platelet_volume_EUR.ACTG
--chrom=['1']
--meta=FALSE
--seed=None

*** only 1 discovery population detected ***

##### process chromosome 1 #####
... parse reference file: /well/ckb/users/aey472/projects/pgs_subtype/data/LDref/snpinfo_mult_1kg_hm3 ...
... 109168 SNPs on chromosome 1 read from /well/ckb/users/aey472/projects/pgs_subtype/data/LDref/snpinfo_mult_1kg_hm3 ...
... parse bim file: /gpfs3/well/ckb/users/dma206/prs/data/ckb_rs/chr1_rs.bim ...
... 3548427 SNPs on chromosome 1 read from /gpfs3/well/ckb/users/dma206/prs/data/ckb_rs/chr1_rs.bim ...
... parse EUR sumstats file: /well/ckb/users/aey472/projects/pgs_subtype/data/sumstats/ldsc/GCST90002346_mean_platelet_volume_EUR.ACTG.txt ...
... 46647143 SNPs read from /well/ckb/users/aey472/projects/pgs_subtype/data/sumstats/ldsc/GCST90002346_mean_platelet_volume_EUR.ACTG.txt ...
Traceback (most recent call last):
  File "/well/ckb/users/aey472/projects/pgs_subtype/programs/PRScsx/PRScsx.py", line 154, in <module>
    main()
  File "/well/ckb/users/aey472/projects/pgs_subtype/programs/PRScsx/PRScsx.py", line 137, in main
    sst_dict[pp] = parse_genet.parse_sumstats(ref_dict, vld_dict, param_dict['sst_file'][pp], param_dict['pop'][pp], param_dict['n_gwas'][pp])
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs3/well/ckb/users/aey472/projects/pgs_subtype/programs/PRScsx/parse_genet.py", line 92, in parse_sumstats
    set(zip(snp_ref, [mapping[aa] for aa in a1_ref], [mapping[aa] for aa in a2_ref])) | \
                                                      ~~~~~~~^^^^
KeyError: '\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1a\x1aC'

The sumstats file is like this:

(prscsx) [16:05:15] aey472@compa008/well/ckb/users/aey472/projects/pgs_subtype/scripts$ head /well/ckb/users/aey472/projects/pgs_subtype/data/sumstats/ldsc/GCST90002346_mean_platelet_volume_EUR.ACTG.txt
SNP     A1      A2      BETA    SE
rs534229142     A       G       -0.049934       0.041214
rs537182016     A       C       0.053095        0.13105
rs558604819     A       G       -0.09791        0.087744
rs575272151     G       C       -0.00722        0.004933
rs544419019     G       C       -0.006777       0.004943
rs561109771     G       T       -0.788577       0.337534
rs574746232     G       T       -0.30433        0.277469
rs540538026     A       G       0.010248        0.019571
rs62635286      G       T       0.002588        0.003859

And the bim file is:

(prscsx) [16:08:32] aey472@compa008/well/ckb/users/aey472/projects/pgs_subtype/scripts$ head /gpfs3/well/ckb/users/dma206/prs/data/ckb_rs/chr1_rs.bim
1       rs1250970180    0       13569   C       T
1       rs1397467439    0       19994   G       C
1       rs1231347385    0       49309   AACTTGTAAACACAAAGAAGGAAACAACAGGC        A
1       rs1237444189    0       50491   T       C
1       rs558396535     0       59032   A       G
1       rs573497853     0       62451   C       A
1       rs1160552650    0       62589   A       G
1       rs1256219665    0       64057   AT      A
1       rs1463026327    0       64860   T       C
1       rs1487991286    0       68807   A       G

and we used the LD references downloaded from the link on your github page.

Any idea what the error might be caused by?

Thanks,

Sam

getian107 commented 6 months ago

Hi Sam -- This is indeed a strange error that we have never seen before. I'd suggest you double check the chr1 reference panel and the snpinfo_mult_1kg_hm3 file to make sure that the files are intact and haven't been modified after downloading. Given that you have only input from one population, you can also try PRS-CS and see if you get the same error (in theory PRS-CSx with input from one population reduces to the PRS-CS model).

fazilaliev commented 6 months ago

Hi Sam, When looking at your .bim file I suspected that all rs names are long (like rs1463026327 which has 10 digits after rs) and on NCBI (link https://www.ncbi.nlm.nih.gov/snp/) they look like are TOPMED SNPs. You are using 1000G panel where almost (maybe all) SNP names have <=9 digits in rs names. Possibly, intersection of bim, sumstat and snpinfo files is empty. Check if you have any SNP left after intersecting the three files. Thanks, Fazil

sahwa commented 6 months ago

Hi both, thanks for your reply.

I re-downloaded the LD reference panel and other files but the error still persisted. I ended up trying on another computing cluster and it worked fine, so I'm not sure what the issues is! I'll leave it open.

You are right that these are topmed imputed SNPs, and there is a fairly low overlap between the panel and sumstats, but all the other chromosomes completely successfully - where the overlap was zero, the program continued and produced an empty output.

getian107 commented 6 months ago

It sounds that the error was due to an issue of the computing cluster not the algorithm but let me know if I can help in any ways.

Regarding the SNP identifiers, it would be important to have the same rs ID as the 1000 genomes project - otherwise the overlap between the reference panel, summary stats and target dataset will be small which may significantly reduce prediction accuracy.

violet229 commented 4 months ago

thanks for sharing, I have the same problem and it solved now CHR SNP BP A1 A2 FRQ_AFR FRQ_AMR FRQ_EAS FRQ_EUR FRQ_SAS FLP_AFR FLP_AMR FLP_EAS FLP_EUR FLP_SAS 1 rs2185539 566875 T C 0.124800 0.000000 0.000000 0.000000 0.000000 1 > 1 rs6681105 592075 C T 0.021940 0.000000 0.000000 0.000000 0.000000 1 > 1 rs3131972 752721 A G 0.709500 0.263700 0.234100 0.161000 0.221900 -1 > 1 rs3131969 754182 A G 0.647500 0.243500 0.266900 0.128200 0.191200 -1 > 1 rs1048488 760912 C T 0.555200 0.194500 0.116100 0.160000 0.189200 -1 > 1 rs12562034 768448 A G 0.085480 0.089340 0.395800 0.092450 0.300600 1 > 1 rs4040617 779322 G A 0.472800 0.155600 0.112100 0.118300 0.172800 1 > 1 rs2905036 792480 C T 0.078670 0.010090 0.000000 0.000000 0.000000 1 > 1 rs4245756 799463 T C 0.098340 0.010090 0.000000 0.000000 0.000000 1 > 1 rs4970383 838555 A C 0.366900 0.276700 0.392900 0.238600 0.379300 1 > 1 rs4475691 846808 T C 0.403200 0.276700 0.108100 0.183900 0.262800 1 > 1 rs28587382 852772 A G 0.016640 0.000000 0.000000 0.000000 0.000000 1 > 1 rs1806509 853954 A C 0.311600 0.455300 0.586300 0.603400 0.568500 1 > 1 rs7537756 854250 G A 0.251100 0.253600 0.113100 0.192800 0.269900 1 > 1 rs6694982 857828 G A 0.040090 0.000000 0.000000 0.000000 0.000000 1 > 1 rs13302982 861808 A G 0.658100 0.171500 0.452400 0.024850 0.123700 -1 > 1 rs4040604 863124 G T 0.665700 0.171500 0.452400 0.024850 0.124700 -1 > 1 rs2340587 864938 G A 0.778400 0.402000 0.514900 0.235600 0.385500 -1 > 1 rs28576697 870645 C T 0.033280 0.269500 0.125000 0.292200 0.316000 1 > 1 rs1110052 873558 T G 0.203500 0.582100 0.489100 0.711700 0.608400 1 > 1 rs7523549 879317 T C 0.215600 0.077810 0.066470 0.039760 0.011250 1 > 1 rs3748592 880238 A G 0.093800 0.080690 0.086310 0.047710 0.082820 1 > 1 rs3748593 880390 A C 0.192100 0.073490 0.067460 0.037770 0.011250 1 > 1 rs2272756 882033 A G 0.017400 0.247800 0.077380 0.249500 0.164600 1 > 1 rs2340582 882803 A G 0.093800 0.080690 0.080360 0.047710 0.082820 1 > 1 rs4246503 884815 A G 0.102900 0.080690 0.080360 0.047710 0.082820 1 > 1 rs3748594 886384 A G 0.216300 0.076370 0.070440 0.038770 0.010220 1 > 1 rs17160698 887162 C T 0.111200 0.020170 0.000000 0.000000 0.000000 1 > 1 rs3748595 887560 A C 0.171000 0.090780 0.079370 0.051690 0.082820 1 > 1 rs3748597 888659 T ^Z^Z^Z^Z^Z^Z^Z^Z^Z^Z^Z^Z^ZC 0.093040 0.080690 0.079370 0.046720