How to use local outcome data?

krz commented 5 years ago

Hi,

Assume I want to do a hypothetical analysis for the effect of BMI on coffee consumption The first step, obtaining the exposure data, is easy. According to the documentation, I do

bmi_gwas <- subset(gwas_catalog, grepl("Speliotes", Author) & Phenotype == "Body mass index")

We find 27 SNPs associated with BMI according to this study. Assume they are called rs001 - rs027.

The second step, getting the outcome, is more difficult. There is no GWAS for coffee consumption. However, I have a local data set with 10,000 individuals, where coffee consumption and the 27 SNPs are available:

 id  |  coffee_consumption | rs001 | rs002 |...
----------------------------------------------
 1   |                 50  |    1  |   2   |
 2   |                  3  |    2  |   0   |

Using this data set, what is the easiest way of obtaining the relevant outcome data frame that I can use for two-sample MR? I can see obtaining effect.outcome, se.outcome, and even pval.outcome from univariate regressions of the SNPs against coffee_consumption but I don't know about the other variables like a1. Are they really necessary? Is there a method or package that inputs my data and automatically outputs a data frame like the one that TwoSampleMR expects?

Thanks

explodecomputer commented 5 years ago

Hey some suggestions below:

Step 1. Generate the summary data for the 27 SNPs' effects on coffee consumption. Standard GWAS practice involves running a regression of coffee_consumption ~ rs001 + pc1 + pc2 + ... + pc10 + other covariates for the first SNP and recording the beta for the SNP, standard error, p-value, effect allele*, other allele, sample size. The generated summary data will look like

SNP effect_allele other_allele beta se pval n
rs001 A G 0.1 0.001 1e-5 10000
rs002 ...

Normally this is done using software like plink, boltlmm, GenABEL, SNPTEST, depending on the data and the circumstances of the samples etc. You can run it manually in R as described above also. The point is to get valid estimates of the SNP effects on the phenotype

the effect allele is important because you want to ask if having more 'G' alleles increases BMI then do they also increase coffee consumption. In the case of your example data for rs001, if that SNP has 2 alleles A and G and your data is coded such that 1 = only one G allele and 2 = two G alleles, then G is the effect allele.

Step 2. Read your generated summary data into R using the TwoSampleMR read_outcome_data function and follow the steps from there https://mrcieu.github.io/TwoSampleMR/

krz commented 5 years ago

Thanks. So it is required that I perform my own GWAS, or at least have the 10 principal components for all SNPs. Only having the 27 dialletic SNPs is not enough?

From the mr.raps package description it seems that I only require the beta and se values for exposure and treatment, nothing else

This function calls mr.raps.shrinkage with overdispersion = TRUE, loss.function = "huber". The input data frame should contain the following variables:

beta.exposure beta.outcome se.exposure se.outcome

But I see that I have to include the principal components at least as you said.

explodecomputer commented 5 years ago

Yeah it's good practice to control for confounding due to population structure when running any genetic association analysis - if you don't have genome wide data or the actual derived genetic principle components then there's not much you can do I guess, but there is a risk that your effect estimates will be inflated.

You don't technically need the effect allele to do the mr.raps analysis because it assumes that the beta for the exposure and the beta for the outcome are in terms of the same effect allele. The TwoSampleMR package doesn't use the effect allele information to do the statistical analysis, just to harmonise the exposure and outcome datasets prior to the analysis. Assuming knowledge of the effect allele is typically a bit dangerous as every study uses a different way of coding the alleles. e.g. see https://academic.oup.com/ije/article/45/6/1717/3072174 for examples of where failing to correctly harmonise led to the wrong answers being published.

NamraAhmad777 commented 1 year ago

@explodecomputer Hello sir ! i am having problems with the steps prior to data harmonization can you help me please? l have two file from the merging step now what shall i do next? i have not performed clumping as well.

merging step code: sleep_durations_p_select_IL2bout_merge1 <- merge(IL2_allSNP,sleep_durations_p_select,by=c("effect_allele","other_allele" ,"CHR","BP"), suffixes = c (".sleep" , ".IL2b")) nrow(sleep_durations_p_select_IL2bout_merge1)

flip_sleep_durations_p_select <- sleep_durations_p_select setnames(flip_sleep_durations_p_select, c("effect_allele","other_allele"), c ("other_allele","effect_allele" ))

flip_sleep_durations_p_select <--flip_sleep_durations_p_select

sleep_durations_p_select_IL2bout_merge2 <- merge(IL2_allSNP,flip_sleep_durations_p_select, by=c("effect_allele","other_allele" ,"CHR","BP"), suffixes = c (".sleep" , ".IL2b")) nrow(sleep_durations_p_select_IL2bout_merge2)

MRCIEU / TwoSampleMR

How to use local outcome data? #97