JonJala / mama

MIT License
13 stars 4 forks source link

All SNPs are being dropped due to non-positive-semi-definiteness of omega #23

Closed kapoormanav closed 2 years ago

kapoormanav commented 3 years ago

Hi Grant and Jon,

All SNPs from my analysis are being dropped due to "non-positive-semi-definiteness of omega". Do you have any suggestions to fix this problem.

Thanks,

Manav

######################################### Running LD Score regression. Regression coefficients (LD): [[-1.45883266e-03 -1.27498494e-03] [-1.27498494e-03 -9.65617573e-05]] Regression coefficients (Intercept): [[1.08435618 1.04543503] [1.04543503 1.00893974]] Regression coefficients (SE^2): [[-0.04235656 -2.6532606 ] [-2.6532606 -3.08265681]]

Creating omega and sigma matrices. Average Omega (including dropped slices) = [[-0.01087047 -0.00785719] [-0.00785719 -0.00084023]] Average Sigma (including dropped slices) = [[1.08297402 1.04543503] [1.04543503 1.00471217]]

Adjusted 0 SNPs to make omega positive semi-definite.

Dropped 4730 SNPs due to non-positive-semi-definiteness of omega. Dropped 4730 SNPs due to non-positive-definiteness of sigma. Dropped 4730 total SNPs due to non-positive-(semi)-definiteness of omega / sigma.

Running main MAMA method.

Preparing results for output.

    Population 0: ('SAS', 'T2D')

/mnt/efs/users/manav.kapoor/mama-mainline/mama_pipeline.py:552: RuntimeWarning: Mean of empty slice. mean_chi_2 = np.square(new_df[Z_COL].to_numpy()).mean() ERROR: Received Numpy error: invalid value (8) Mean Chi^2 for ('SAS', 'T2D') = nan Population 1: ('EUR', 'T2D') ERROR: Received Numpy error: invalid value (8) Mean Chi^2 for ('EUR', 'T2D') = nan

JonJala commented 3 years ago

If there are negative diagonal elements in the LD score regression coefficients, that's going to torpedo your results. That might happen with a very small sample size and/or mean chi squared < 1. Mind including your full log with the logging turned up so that we can have a little more information?

kapoormanav commented 3 years ago

2021-06-07 14:07:59,337

Reading in and running QC on LD Scores 2021-06-07 14:07:59,337 Reading in LD Scores file: ./GGC_Freeze_Three_merged_with_ukb_500_random_eur_exc_mhc_ldsc.1.l2.ldscore.gz 2021-06-07 14:07:59,568

Reading in summary statistics. 2021-06-07 14:07:59,568

2021-06-07 14:07:59,568 Reading in ('SAS', 'T2D') sumstats file: ./type_2_diabetes_UKB_450_SAS.txt 2021-06-07 14:08:13,402 Running QC on ('SAS', 'T2D') summary statistics 2021-06-07 14:08:13,403 Initial number of SNPs / rows = 1402142

2021-06-07 14:08:18,965 Filtered out 0 SNPs with "NO NAN" (Filters out SNPs with any NaN values in required columns {'FREQ', 'SNP', 'A2', 'BETA', 'CHR', 'BP', 'A1', 'SE', 'P'}) 2021-06-07 14:08:18,965 Filtered out 775365 SNPs with "FREQ BOUNDS" (Filters out SNPs with FREQ values outside of [0.01, 0.99]) 2021-06-07 14:08:18,965 Filtered out 0 SNPs with "SE BOUNDS" (Filters out SNPs with non-positive SE values) 2021-06-07 14:08:18,965 Filtered out 7193 SNPs with "CHR VALUES" (Filters out SNPs with listed chromosomes not in ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y']) 2021-06-07 14:08:18,965 Filtered out 0 SNPs with "DUPLICATE ALLELE SNPS" (Filters out SNPs with major allele = minor allele) 2021-06-07 14:08:18,965 Filtered out 148443 SNPs with "PALINDROMIC SNPS" (Filters out SNPs where major / minor alleles are a base pair) 2021-06-07 14:08:18,965 Filtered out 77665 SNPs with "INVALID ALLELES" (Filters out SNPs with alleles not in {'G', 'A', 'T', 'C'}) 2021-06-07 14:08:18,965 Filtered out 0 SNPs with "NEGATIVE GWAS P" (Filters out SNPs with negative GWAS P values) 2021-06-07 14:08:18,967

Filtered out 845055 SNPs in total (as the union of drops, this may be less than the total of all the per-filter drops) 2021-06-07 14:08:18,967 Additionally dropped 0 duplicate SNPs 2021-06-07 14:08:18,992

2021-06-07 14:08:18,992 Reading in ('EUR', 'T2D') sumstats file: ./EUR_T2d.txt 2021-06-07 14:08:22,889 Running QC on ('EUR', 'T2D') summary statistics 2021-06-07 14:08:22,890 Initial number of SNPs / rows = 383358

2021-06-07 14:08:24,666 Filtered out 0 SNPs with "NO NAN" (Filters out SNPs with any NaN values in required columns {'FREQ', 'SNP', 'A2', 'BETA', 'CHR', 'BP', 'A1', 'SE', 'P'}) 2021-06-07 14:08:24,666 Filtered out 0 SNPs with "FREQ BOUNDS" (Filters out SNPs with FREQ values outside of [0.01, 0.99]) 2021-06-07 14:08:24,666 Filtered out 0 SNPs with "SE BOUNDS" (Filters out SNPs with non-positive SE values) 2021-06-07 14:08:24,666 Filtered out 6 SNPs with "CHR VALUES" (Filters out SNPs with listed chromosomes not in ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y']) 2021-06-07 14:08:24,666 Filtered out 0 SNPs with "DUPLICATE ALLELE SNPS" (Filters out SNPs with major allele = minor allele) 2021-06-07 14:08:24,666 Filtered out 27323 SNPs with "PALINDROMIC SNPS" (Filters out SNPs where major / minor alleles are a base pair) 2021-06-07 14:08:24,666 Filtered out 5697 SNPs with "INVALID ALLELES" (Filters out SNPs with alleles not in {'G', 'A', 'T', 'C'}) 2021-06-07 14:08:24,666 Filtered out 0 SNPs with "NEGATIVE GWAS P" (Filters out SNPs with negative GWAS P values) 2021-06-07 14:08:24,667

Filtered out 33022 SNPs in total (as the union of drops, this may be less than the total of all the per-filter drops) 2021-06-07 14:08:24,667 Additionally dropped 0 duplicate SNPs 2021-06-07 14:08:24,780

Number of SNPS in initial intersection of all sources: 9350 2021-06-07 14:08:25,947 Standardizing reference alleles in summary statistics. 2021-06-07 14:08:25,986 Standardized to population: ('SAS', 'T2D') 2021-06-07 14:08:25,986 Dropped 0 SNPs during reference allele standardization. 2021-06-07 14:08:25,993

Running LD Score regression. 2021-06-07 14:08:25,995 Regression coefficients (LD): [[-0.00218478 -0.00255036] [-0.00255036 -0.00026218]] 2021-06-07 14:08:25,996 Regression coefficients (Intercept): [[1.05870677 1.03990404] [1.03990404 1.01028588]] 2021-06-07 14:08:25,996 Regression coefficients (SE^2): [[ 0.55252289 -1.37263668] [-1.37263668 -3.60841569]] 2021-06-07 14:08:25,996

Creating omega and sigma matrices. 2021-06-07 14:08:26,274 Average Omega (including dropped slices) = [[-0.01974345 -0.01886878] [-0.01886878 -0.00278857]] 2021-06-07 14:08:26,275 Average Sigma (including dropped slices) = [[1.07551735 1.03990404] [1.03990404 1.00548116]] 2021-06-07 14:08:26,275 Adjusted 0 SNPs to make omega positive semi-definite. 2021-06-07 14:08:26,275 Dropped 9350 SNPs due to non-positive-semi-definiteness of omega. 2021-06-07 14:08:26,275 Dropped 7118 SNPs due to non-positive-definiteness of sigma. 2021-06-07 14:08:26,275 Dropped 9350 total SNPs due to non-positive-(semi)-definiteness of omega / sigma. 2021-06-07 14:08:26,275

Running main MAMA method. 2021-06-07 14:08:26,372 Preparing results for output.

2021-06-07 14:08:26,372 Population 0: ('SAS', 'T2D') 2021-06-07 14:08:26,410 Received Numpy error: invalid value (8) 2021-06-07 14:08:26,410 Mean Chi^2 for ('SAS', 'T2D') = nan 2021-06-07 14:08:26,410 Population 1: ('EUR', 'T2D') 2021-06-07 14:08:26,443 Received Numpy error: invalid value (8) 2021-06-07 14:08:26,443 Mean Chi^2 for ('EUR', 'T2D') = nan 2021-06-07 14:08:26,450 Final SNP count = 0 2021-06-07 14:08:26,488 Writing results to disk. 2021-06-07 14:08:26,488 ./T2D_MAMA_SAS_T2D.res 2021-06-07 14:08:26,517 ./T2D_MAMA_EUR_T2D.res 2021-06-07 14:08:26,542 Execution complete.

JonJala commented 3 years ago

It looks like you're losing a ton of SNPs when the data sets are being intersected / merged (it gets pared down to less than 10k). Are you expecting such a small overlap? Any chance anything happened to the rsID columns of your data or that they're somehow formatted very differently from each other or something like that?>

ggoldman1 commented 3 years ago

On top of what Jon said, it looks like 50% of your SAS SNPs are being dropped due to allele frequency. You can use the --freq-bounds MIN MAX flag to change that, if you want. Only about 9% (33,022 / 383,358) of your EUR SNPs made it through QC. You can loosen the QC that MAMA applies, but I think it's a bit of a concern that you're only starting with under 400K SNPs in EUR. Is this low number due to some additional QC you're applying in preprocessing?

kapoormanav commented 3 years ago

Hi Grant,

The data is from exome sequencing of SAS and UKB cohorts and there is not much overlap for the common SNPs. I already filtered the EUR data for the common SNPs (AF > 0.05%). I can use the imputed data to see if the overlap increases between the data. Unfortunately there is no array data for our SAS group and I wanted to avoid the imputation of SAS.

Thanks,

Manav

paturley commented 3 years ago

That's tricky. MAMA essentially calculates the heritability and genetic correlation between the populations using LD score regression, and ir there aren't enough SNPs to get reliable estimates, then you end up with crazy results like the ones you are getting. One option is to use the imputed SAS results for the MAMA step, but drop the imputed variants after the MAMA step. The imputed variants may be reliable enough for LD score regression even if you think the final results are suspect.

That said, you'd need to make it clear that is what you did when you write up your results, and reviewers may get nervous if it looks too unusual.

On Mon, Jun 7, 2021 at 3:49 PM kapoormanav @.***> wrote:

Hi Grant,

The data is from exome sequencing of SAS and UKB cohorts and there is not much overlap for the common SNPs. I already filtered the EUR data for the common SNPs (AF > 0.05%). I can use the imputed data to see if the overlap increases between the data. Unfortunately there is no array data for our SAS group and I wanted to avoid the imputation of SAS.

Thanks,

Manav

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/23#issuecomment-856209341, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5KELSM7IJ5B7YMHYGLTRUPDDANCNFSM46IAQDQQ .

kapoormanav commented 3 years ago

Thanks Patrick. I will try that and let you people know. I will make sure to describe the process for reviewers.

JonJala commented 3 years ago

Any luck with trying that out?

kapoormanav commented 3 years ago

Hi, We are reimputing our data to 1000 genome as it has bit better representation of whole genomes for SAS subjects. Meanwhile, I found one issue that was because of my oversight. I didn't convert the odds ratios to betas. After converting to betas, analysis worked for chromosomes with larger proportion of SNPs (chr1 and chr2). But analysis was still wonky. The test statistics was highly inflated with P values ranging in negative 300. We will have the latest imputed data by next week and I will rerun the pipeline then. Thanks!

JonJala commented 3 years ago

Ok, sounds good!

samkleeman1 commented 3 years ago

Hi,

I am having similar issues with almost all my SNPs being removed by the 'non-positive-semi-definiteness of omega.' filter.

I am using UKB data, imputed by UKB using HRC (the standard data provided) for EUR, AFR and CSA population subgroups. I selected SNPs with INFO score > 0.8 derived from 1000 randomly selected subjects in each population. This gives trans-ancestral LD scores spanning approximately 6.3 million SNPs. Then when I try to merge summary statistics in each population (also from UKB, so covering the same SNP set and imputation strategy), I am getting almost all SNPs removed by 'non-positive-semi-definiteness of omega'.

I am slightly confused as I would have thought this is a routine implementation of this approach?

I have copied the output below.

Kind regards,

Sam Kleeman PhD Student, CSHL, NY, USA

<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<>
<> MAMA: Multi-Ancestry Meta-Analysis
<> Version: 1.0.0
<> (C) 2020 Social Science Genetic Association Consortium (SSGAC)
<> MIT License
<>
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<> Software-related correspondence: grantgoldman0@gmail.com or jjala.ssgac@gmail.com
<> All other correspondence: paturley@broadinstitute.org
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>

See full log at: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/cystatinc_mama_merged.log

Program executed via:
/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/mama/mama.py  \ 
    --sumstats /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/AFR/summ_SEM.tsv,AFR,cyc /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/CSA/summ_SEM.tsv,CSA,cyc /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/EUR/summ_SEM_cystatin.tsv,EUR,cyc  \ 
    --ld-scores /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr*.l2.ldscore.gz  \ 
    --out ../cystatinc_mama_merged  \ 
    --add-a1-col-match A1  \ 
    --add-a2-col-match A2  \ 
    --add-snp-col-match SNP  \ 
    --add-beta-col-match est  \ 
    --add-freq-col-match MAF  \ 
    --add-se-col-match SE  \ 
    --add-p-col-match Pval_Estimate  \ 
    --allow-palindromic-snps

Reading in and running QC on LD Scores
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr7.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr22.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr19.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr11.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr12.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr5.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr21.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr2.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr3.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr17.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr8.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr16.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr9.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr1.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr6.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr14.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr10.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr18.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr13.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr20.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr15.l2.ldscore.gz
Reading in LD Scores file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr4.l2.ldscore.gz

Reading in summary statistics.

Reading in ('AFR', 'cyc') sumstats file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/AFR/summ_SEM.tsv

Running QC on ('AFR', 'cyc') summary statistics

Initial number of SNPs / rows = 7184911

Filtered out 0 SNPs with "NO NAN" (Filters out SNPs with any NaN values in required columns {'A1', 'P', 'CHR', 'BP', 'SE', 'SNP', 'A2', 'FREQ', 'BETA'})
Filtered out 396147 SNPs with "FREQ BOUNDS" (Filters out SNPs with FREQ values outside of [0.01, 0.99])
Filtered out 0 SNPs with "SE BOUNDS" (Filters out SNPs with non-positive SE values)
Filtered out 0 SNPs with "CHR VALUES" (Filters out SNPs with listed chromosomes not in ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y'])
Filtered out 0 SNPs with "DUPLICATE ALLELE SNPS" (Filters out SNPs with major allele = minor allele)
Filtered out 0 SNPs with "INVALID ALLELES" (Filters out SNPs with alleles not in {'A', 'G', 'C', 'T'})
Filtered out 0 SNPs with "NEGATIVE GWAS P" (Filters out SNPs with negative GWAS P values)

Filtered out 396147 SNPs in total (as the union of drops, this may be less than the total of all the per-filter drops)
Additionally dropped 0 duplicate SNPs

Reading in ('CSA', 'cyc') sumstats file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/CSA/summ_SEM.tsv

Running QC on ('CSA', 'cyc') summary statistics

Initial number of SNPs / rows = 8048719

Filtered out 0 SNPs with "NO NAN" (Filters out SNPs with any NaN values in required columns {'A1', 'P', 'CHR', 'BP', 'SE', 'SNP', 'A2', 'FREQ', 'BETA'})
Filtered out 464820 SNPs with "FREQ BOUNDS" (Filters out SNPs with FREQ values outside of [0.01, 0.99])
Filtered out 0 SNPs with "SE BOUNDS" (Filters out SNPs with non-positive SE values)
Filtered out 0 SNPs with "CHR VALUES" (Filters out SNPs with listed chromosomes not in ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y'])
Filtered out 0 SNPs with "DUPLICATE ALLELE SNPS" (Filters out SNPs with major allele = minor allele)
Filtered out 0 SNPs with "INVALID ALLELES" (Filters out SNPs with alleles not in {'A', 'G', 'C', 'T'})
Filtered out 0 SNPs with "NEGATIVE GWAS P" (Filters out SNPs with negative GWAS P values)

Filtered out 464820 SNPs in total (as the union of drops, this may be less than the total of all the per-filter drops)
Additionally dropped 0 duplicate SNPs

Reading in ('EUR', 'cyc') sumstats file: /mnt/grid/ukbiobank/data/Application58510/skleeman/gwas_cystatinc/PRS/EUR/summ_SEM_cystatin.tsv

Running QC on ('EUR', 'cyc') summary statistics

Initial number of SNPs / rows = 8408363

Filtered out 0 SNPs with "NO NAN" (Filters out SNPs with any NaN values in required columns {'A1', 'P', 'CHR', 'BP', 'SE', 'SNP', 'A2', 'FREQ', 'BETA'})
Filtered out 106550 SNPs with "FREQ BOUNDS" (Filters out SNPs with FREQ values outside of [0.01, 0.99])
Filtered out 0 SNPs with "SE BOUNDS" (Filters out SNPs with non-positive SE values)
Filtered out 0 SNPs with "CHR VALUES" (Filters out SNPs with listed chromosomes not in ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y'])
Filtered out 0 SNPs with "DUPLICATE ALLELE SNPS" (Filters out SNPs with major allele = minor allele)
Filtered out 0 SNPs with "INVALID ALLELES" (Filters out SNPs with alleles not in {'A', 'G', 'C', 'T'})
Filtered out 0 SNPs with "NEGATIVE GWAS P" (Filters out SNPs with negative GWAS P values)

Filtered out 106550 SNPs in total (as the union of drops, this may be less than the total of all the per-filter drops)
Additionally dropped 0 duplicate SNPs

Number of SNPS in initial intersection of all sources: 6371548

Standardizing reference alleles in summary statistics.
Standardized to population: ('AFR', 'cyc')
Dropped 0 SNPs during reference allele standardization.

Running LD Score regression.
Regression coefficients (LD):
[[-3.78167042e-006 -4.16683593e-052 -3.08677069e-006]
 [-4.16683593e-052  3.08958396e-108  3.47087472e-051]
 [-3.08677069e-006  3.47087472e-051  2.84505267e-007]]
Regression coefficients (Intercept):
[[ 4.63763260e-03  0.00000000e+00  2.08242133e-04]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00]
 [ 2.08242133e-04  0.00000000e+00 -7.33317158e-06]]
Regression coefficients (SE^2):
[[ 2.13826844e-08 -1.58295174e-28  3.55113200e-04]
 [-1.58295174e-28  2.30266760e-56  2.45580980e-28]
 [ 3.55113200e-04  2.45580980e-28  1.82818718e+00]]

Creating omega and sigma matrices.
Average Omega (including dropped slices) =
[[-5.53372417e-004 -4.47908790e-050 -3.47376702e-004]
 [-4.47908790e-050  5.68809103e-106  6.12977954e-049]
 [-3.47376702e-004  6.12977954e-049  6.12108703e-005]]
Average Sigma (including dropped slices) =
[[5.38197663e-03 0.00000000e+00 2.08242133e-04]
 [0.00000000e+00 3.56580622e-09 0.00000000e+00]
 [2.08242133e-04 0.00000000e+00 1.18942935e-04]]

Adjusted 13859 SNPs to make omega positive semi-definite.

Dropped 6356207 SNPs due to non-positive-semi-definiteness of omega.
Dropped 0 SNPs due to non-positive-definiteness of sigma.
Dropped 6356207 total SNPs due to non-positive-(semi)-definiteness of omega / sigma.

Running main MAMA method.

Preparing results for output.

    Population 0: ('AFR', 'cyc')
        Mean Chi^2 for ('AFR', 'cyc') = 1.4034997256374622e+19
    Population 1: ('CSA', 'cyc')
        Mean Chi^2 for ('CSA', 'cyc') = nan
    Population 2: ('EUR', 'cyc')
        Mean Chi^2 for ('EUR', 'cyc') = 2.494013823014077e+20

Final SNP count = 15341
Writing results to disk.
    ../cystatinc_mama_merged_AFR_cyc.res
    ../cystatinc_mama_merged_CSA_cyc.res
    ../cystatinc_mama_merged_EUR_cyc.res

Execution complete.
paturley commented 3 years ago

It looks like you have negative diagonal coefficients on your LD score regressions. Maybe try the following option, which will set the intercept to zero: --reg-int-zero and see if that helps.

@ggoldman1 @JonJala Maybe we should make it so the software throws an error in cases like this since the results will never make sense if the diagonal entries of the LD score regression coefficient matrices are negative.

samkleeman1 commented 3 years ago

I tried using --reg-int-zero - it runs across all SNPs but I get crazy p values in the region of 8.169262804980647e-291088526. Why do I see negative diagonal coefficients in LD score regressions?

This is the script I use to compute LD scores. IID ancestry file and SNP ancestry file were generated as per the Jupyter Notebook. SNPs are from UK Biobank with INFO > 0.8, randomly selected 1000 subjects per population.

source mama/mama_env/bin/activate 

python3 ./mama/mama_ldscores.py --ances-path "./iid_ances_file" \
                            --snp-ances "./snp_ances_file" \
                            --ld-wind-cm 1 \
                            --stream-stdout \
                            --bfile-merged-path "./ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8" \
                            --out "ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld"
ggoldman1 commented 3 years ago

Hey Sam,

Did you see the most recent response at #15? I see you're using a centimorgan-based window but it seems like from that thread your data doesn't have cM reported. If this is the case then your LD scores will likely be wrong which could cause issues.

Grant

samkleeman1 commented 3 years ago

Sorry the script above was an old version, I used the following (with the ld-wind-kb parameter):

source ../mama/mama_env/bin/activate 

python3 ../mama/mama_ldscores.py --ances-path "../iid_ances_file" \
                            --snp-ances "../snp_ances_file" \
                            --ld-wind-kb 1000 \
                            --stream-stdout \
                            --bfile-merged-path "../ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_${chr}" \
                            --out "../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr${chr}"
EOF

If I add in centimorgan will that fix the issue??

ggoldman1 commented 3 years ago

No, not unless your .bim files have a non-zero third column. Otherwise leaving it as you have is fine.

Out of curiosity, what is the mean chi^2 in each of your input GWAS? Per #26, if you rerun with --debug passed then it'll tell you in the log file.

samkleeman1 commented 2 years ago

I can see the following:

Do you have any suggestions about how we can resolve this? We are really keen to use this approach in our work.

    Population 0: ('AFR', 'cyc')
        Mean Chi^2 for ('AFR', 'cyc') = 1.0611160357792465e+20
    Population 1: ('CSA', 'cyc')
        Mean Chi^2 for ('CSA', 'cyc') = nan
    Population 2: ('EUR', 'cyc')
        Mean Chi^2 for ('EUR', 'cyc') = 1.292007853733898e+21
JonJala commented 2 years ago

That looks like it might tbe the mean chi squared of the outputs? If you pull the latest from the repo and re-run (making sure to keep the --verbose flag set), the log should have the mean chi squared of the input populations after it goes through QCing and harmonizing all the input summary statistics.

The log file should have something like \"Harmonized AFR cyc mean chi squared: [VALUE]\" in it

samkleeman1 commented 2 years ago

Sure, have done

Harmonized AFR cyc mean chi squared: 0.785988856018892
Harmonized CSA cyc mean chi squared: 0.8874974162834628
Harmonized EUR cyc mean chi squared: 2.9613204740241974
JonJala commented 2 years ago

Those first two mean chi squared values are quite low. From what I understand, they should be above 1. If you include the complete log from the recent run, we maybe could see if anything else looks funny, but otherwise maybe double-check to make sure your standard errors and betas are ok?

On Tue, Aug 3, 2021, 2:58 PM samkleeman1 @.***> wrote:

Sure, have done

Harmonized AFR cyc mean chi squared: 0.785988856018892 Harmonized CSA cyc mean chi squared: 0.8874974162834628 Harmonized EUR cyc mean chi squared: 2.9613204740241974

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/23#issuecomment-892087097, or unsubscribe https://github.com/notifications/unsubscribe-auth/APIOF57MD3EGUG7ZLJQ3OKDT3A35HANCNFSM46IAQDQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

samkleeman1 commented 2 years ago

I ran again using a different set of phenotypes. They are standard UK Biobank phenotypes (eGFR Creatinine), with GWAS implemented in BOLT-LMM, with summary statistics directly imported to Mama. I am really struggling to understanding why it isn't working and am grateful for your help with this.

python3 /mnt/grid/ukbiobank/data/Application58510/skleeman/was/mama/mama/mama.py --sumstats "/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas/AFR/creatinine_summary2.tsv,AFR,cyc" "/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas/CSA/creatinine_summary2.tsv,CSA,cyc" "/mnt/grid/ukbiobank/data/Application58510/skleeman/EUR/creatinine_summary2.tsv,EUR,cyc" \
                   --ld-scores "/mnt/grid/ukbiobank/data/Application58510/skleeman/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr*.l2.ldscore.gz" \
                   --out "./mama_merged" \
                   --replace-a1-col-match "a1" --verbose \
                   --replace-bp-col-match "BP" --replace-chr-col-match "CHR" --replace-freq-col-match "VAF" \
                   --replace-info-col-match "info" \
                   --replace-a2-col-match "a2" --replace-snp-col-match "snpid" --replace-beta-col-match "beta" \
                   --replace-se-col-match "se" --replace-p-col-match "pval" --allow-palindromic-snps
Harmonized AFR cyc mean chi squared: 1.0315720691747015
Harmonized CSA cyc mean chi squared: 1.0396088044819032
Harmonized EUR cyc mean chi squared: 2.681982126791807

Dropped 6596642 total SNPs due to non-positive-(semi)-definiteness of omega / sigma.
JonJala commented 2 years ago

The ld score creation ran ok, right? What was the full command you ran there?

I'll ask some other folks about the latest results, though.

On Wed, Aug 4, 2021, 3:13 PM samkleeman1 @.***> wrote:

I ran again using a different set of phenotypes. They are standard UK Biobank phenotypes (eGFR Creatinine), with GWAS implemented in BOLT-LMM, with summary statistics directly imported to Mama. I am really struggling to understanding why it isn't working and am grateful for your help with this.

python3 /mnt/grid/ukbiobank/data/Application58510/skleeman/was/mama/mama/mama.py --sumstats "/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas/AFR/creatinine_summary2.tsv,AFR,cyc" "/mnt/grid/ukbiobank/data/Application58510/skleeman/gwas/CSA/creatinine_summary2.tsv,CSA,cyc" "/mnt/grid/ukbiobank/data/Application58510/skleeman/EUR/creatinine_summary2.tsv,EUR,cyc" \ --ld-scores "/mnt/grid/ukbiobank/data/Application58510/skleeman/mama/out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr*.l2.ldscore.gz" \ --out "./mama_merged" \ --replace-a1-col-match "a1" --verbose \ --replace-bp-col-match "BP" --replace-chr-col-match "CHR" --replace-freq-col-match "VAF" \ --replace-info-col-match "info" \ --replace-a2-col-match "a2" --replace-snp-col-match "snpid" --replace-beta-col-match "beta" \ --replace-se-col-match "se" --replace-p-col-match "pval" --allow-palindromic-snps

Harmonized AFR cyc mean chi squared: 1.0315720691747015 Harmonized CSA cyc mean chi squared: 1.0396088044819032 Harmonized EUR cyc mean chi squared: 2.681982126791807

Dropped 6596642 total SNPs due to non-positive-(semi)-definiteness of omega / sigma.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/23#issuecomment-892905912, or unsubscribe https://github.com/notifications/unsubscribe-auth/APIOF53MFHGTJBHY3MCTBMLT3GGNXANCNFSM46IAQDQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

samkleeman1 commented 2 years ago

I ran across each chromosome separately. No error messages to my knowledge.

source ../mama/mama_env/bin/activate 

python3 ../mama/mama_ldscores.py --ances-path "../iid_ances_file"                             --snp-ances "../snp_ances_file"                             --ld-wind-kb 1000                             --stream-stdout                             --bfile-merged-path "../ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_8"                             --out "../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr8"
paturley commented 2 years ago

Curious. It's possible that the LD score software has some problem in non-standardized units. Can you try generating LD scores based on standardized genotypes and run the MAMA code with those LD scores and the standardized model flag?

On Wed, Aug 4, 2021 at 4:41 PM samkleeman1 @.***> wrote:

I ran across each chromosome separately. No error messages to my knowledge.

source ../mama/mama_env/bin/activate

python3 ../mama/mama_ldscores.py --ances-path "../iid_ances_file" --snp-ances "../snp_ances_file" --ld-wind-kb 1000 --stream-stdout --bfile-merged-path "../ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_8" --out "../out/ukb_EUR_AFR_CSA_EAS_maf0.01_info0.8_ld_chr8"

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/23#issuecomment-892959672, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5OTXUMGTFJ4AMPERDLT3GQXNANCNFSM46IAQDQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

samkleeman1 commented 2 years ago

What do you mean by standardized genotypes? Like recoding to 0/1/2?

paturley commented 2 years ago

The MAMA method can be based on a model where effect sizes are assumed to be drawn from iid distribution either assuming that the genotypes are coded as 0/1/2 or assuming they are standardized to have mean zero and variance one. I think the default of the software is to assume the 0/1/2 model, but we recently started noticing a few funny things in the code that make me worry about that specification.

In practice, this would mean regenerating the LD scores and running MAMA with the standardized genotype flag. You would not have to modify any of your data. (Can you remind me what the flags are to do this, Jon or Grant?)

On Thu, Aug 5, 2021 at 10:34 AM samkleeman1 @.***> wrote:

What do you mean by standardized genotypes? Like recoding to 0/1/2?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/23#issuecomment-893509651, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFBUB5JEGVP3MSKALXJYGUDT3KOO3ANCNFSM46IAQDQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

ggoldman1 commented 2 years ago

When calling the ldsc script, you should specify --std-geno-ldsc. When calling MAMA, pass --use-standardized-units (we should probably harmonize these, I can open a PR for that today) passing in the LD scores you generated in the previous step.

samkleeman1 commented 2 years ago

Running now, will keep you posted

samkleeman1 commented 2 years ago

It worked! Thanks so much guys!

ggoldman1 commented 2 years ago

Interesting. This is good for us to know. Thanks Sam!

On Sat, Aug 7, 2021 at 1:59 PM samkleeman1 @.***> wrote:

It worked! Thanks so much guys!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JonJala/mama/issues/23#issuecomment-894686252, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2FCZXYDUY6BSSDXPFZDVLT3VX6JANCNFSM46IAQDQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .