cuelee / pleio

15 stars 6 forks source link

ValueError: Found Allele mismatch #19

Open SylviaXJY opened 1 year ago

SylviaXJY commented 1 year ago

When I call ./ldsc_preprocess.py -h it runs and I successfully get the "sg.txt.gz" and "ce.txt.gz". However when I attempt to check the log file I get the following error:

ValueError: Found Allele mismatch: ['rs7299872' 'rs7299873' 'rs7299874' ... 'rs10943760' 'rs7254116' 'rs11954743']

And I am sure I have already adjusted my VCF files to match the reference. What am I missing here? Thanks a lot!

image

cuelee commented 1 year ago

Hello, thanks for reporting this issue. The error indicates that there is an allele flip in your input data. If this is not the case, there may be a bug in the code.

Could you please provide me with a replicable dataset? I'll check it out with code.

Best Regards, Cue

SylviaXJY commented 1 year ago

Thank you very much for your prompt response!! Here is one of my data(MS & TIA): MS GWAS TIA GWAS

error log file: 31cf996d96db86aecd985385d6493b8

And in another test (TIA & IBD) , this error did not occur: IBD GWAS input.txt log file: image

cuelee commented 1 year ago

Hello, SylviaXJY

I found what caused the error.

:: "ldsc_preprocess.py" does not allow allele mismatch, all risk(A1) and reference(A2) alleles from all .sumstats should match.

In the exemplary dataset, I found several allele mismatches. One of them is SNP 'rs12478753': ms: rs12478753 G(A1) A(A2) -1.762 115803.000 TID: rs12478753 A(A1) G(A2) 447388 0.752

Please do the following QC for your summary statistics before you analyze them using PLEIO. I will elaborate on the details of the QC process for PLEID analysis in WIKI later:

P.S. During the investigation, I found a minor bug and fixed it.

Best Regards, Cue

cuelee commented 1 year ago

I also found a difference between the log file I generated using the data and what you generated. Currently, I don't know what causes this difference. Let me know if you have any other questions related with this.

Below is the .log I got from the ldsc_preprocess:


Beginning analysis at Sat Nov 12 16:24:56 2022 Read 2 traits from input Failed to create a directory at : output Failed to create a directory at : output/temp Dividing z-scores with the correction factor (the squared root of the LDSC h2 analysis intercept value). Dividing z-scores with the correction factor (the squared root of the LDSC h2 analysis intercept value). Generate input files (sumstats.txt.gz) for LDSC --rg analysis Generated output/temp/transient_ischemic_attack.sumstats.tsv.sumstats.gz Generated output/temp/ms.sumstats.tsv.sumstats.gz Number of variants in common: 1051555 Found 0 duplicated variants Traceback (most recent call last): File "./ldsc_preprocess.py", line 482, in preprocess(args,log) File "./ldsc_preprocess.py", line 349, in preprocess sumstat_data = generate_sumstat_data(data) File "./ldsc_preprocess.py", line 162, in init self.check_sumstats() File "./ldsc_preprocess.py", line 297, in check_sumstats check_allele_mismatch(A1) File "./ldsc_preprocess.py", line 271, in check_allele_mismatch raise ValueError('Found Allele mismatch: {}'.format(s)) ValueError: Found Allele mismatch: ['rs12478753' 'rs61826221' 'rs7607542' ... 'rs4663010' 'rs4663011' 'rs1910866']

Analysis finished at Sat Nov 12 16:26:27 2022 Total time elapsed: 1.0m:31.29s

SylviaXJY commented 1 year ago

Thank you for your prompt response! And Thank you for the solution, I will try it!

Best, Xiongjy