chr1swallace / coloc

Repo for the R package coloc
139 stars 44 forks source link

MAF score from GWAS results using COLOC.susie as well as LD matrix question #101

Closed HKJ396 closed 1 year ago

HKJ396 commented 1 year ago

Hello,

Thank you so much for developing this package. I have my GWAS summary data filtered so it's just the top hits, GTEx eqtl data for the relevant tissue, and an LD matrix. However, I have a couple of questions regarding the coloc.susie function.

First of all, when I supply MAF scores for the GWAS summary results, is this the minor allele frequency or just the frequency of the associated allele which might not necessarily be the minor allele? For example, when I look at my plink allele freq output I might be given a frequency that is above 0.5 for one allele and for another allele I might be given a value below 0.5? For example,

Here’s the first lines of the file:

CHROM ID REF ALT ALT_FREQS OBS_CT

1 rs376342519:10616:CCGCCGTTGCAAAGGCGCGCCG:C CCGCCGTTGCAAAGGCGCGCCG C 0.995708 3862 1 1:54712:T:TTTTC T TTTTC 0.595903 3862 1 rs368808541:603010:C:A C A 0.00194132 3862

The frequency is given for C – 0.995… Whereas on db SNP the MAF is ~ 0.1 for CCGCCGTTGCAAAGGCGCGCCG For the third line rs368808541 – the MAF matches what’s on dbSNP. Therefore, I'm not sure if this is the values I should use in the coloc function? Or should I do 1-(any value above 0.5) to find the minor allele?

I have also calculated an LD matrix for my top SNPs - using the PLINK --r2 function so that only SNPs within 1Mb of my top SNPs and r2 above 0.6 is present. Is this okay? I notice the manual states to use --r function rather than r2 if the datasets have different LD however I will only match up the SNPs that are common in both so I assuming they will have the same LD?

Thank you in advance.

chr1swallace commented 1 year ago

Hi,

I'm afraid just having the top hits is not going to be enough. Please see the vignette http://chr1swallace.github.io/coloc/articles/a02_data.html [https://chr1swallace.github.io/coloc/logo.png]http://chr1swallace.github.io/coloc/articles/a02_data.html Coloc: data structures • coloc - GitHub Pageshttp://chr1swallace.github.io/coloc/articles/a02_data.html Data structures. The enumeration approaches to coloc coloc.abf and coloc.susie require the same data format. To be flexible, coloc.abf allows a wider variety of inputs, which can sometimes be confusing. This document aims to set out what ideal and acceptable data structures are, as well as covering some points on how you should select the SNPs for study. chr1swallace.github.io

MAF can be either minor allele frequency or effect allele frequency - it is used as f*(1-f) but should be the MAF as close as possible to the samples in your study. LD does need to be r and not rsquared.


From: HKJ396 @.> Sent: 19 October 2022 10:22 To: chr1swallace/coloc @.> Cc: Subscribed @.***> Subject: [chr1swallace/coloc] MAF score from GWAS results using COLOC.susie as well as LD matrix question (Issue #101)

Hello,

Thank you so much for developing this package. I have my GWAS summary data filtered so it's just the top hits, GTEx eqtl data for the relevant tissue, and an LD matrix. However, I have a couple of questions regarding the coloc.susie function.

First of all, when I supply MAF scores for the GWAS summary results, is this the minor allele frequency or just the frequency of the associated allele which might not necessarily be the minor allele? For example, when I look at my plink allele freq output I might be given a frequency that is above 0.5 for one allele and for another allele I might be given a value below 0.5? For example,

Here’s the first lines of the file:

CHROM ID REF ALT ALT_FREQS OBS_CT

1 rs376342519:10616:CCGCCGTTGCAAAGGCGCGCCG:C CCGCCGTTGCAAAGGCGCGCCG C 0.995708 3862 1 1:54712:T:TTTTC T TTTTC 0.595903 3862 1 rs368808541:603010:C:A C A 0.00194132 3862

The frequency is given for C – 0.995… Whereas on db SNP the MAF is ~ 0.1 for CCGCCGTTGCAAAGGCGCGCCG For the third line rs368808541 – the MAF matches what’s on dbSNP. Therefore, I'm not sure if this is the values I should use in the coloc function? Or should I do 1-(any value above 0.5) to find the minor allele?

I have also calculated an LD matrix for my top SNPs - using the PLINK --r2 function so that only SNPs within 1Mb of my top SNPs and r2 above 0.6 is present. Is this okay? I notice the manual states to use --r function rather than r2 if the datasets have different LD however I will only match up the SNPs that are common in both so I assuming they will have the same LD?

Thank you in advance.

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchr1swallace%2Fcoloc%2Fissues%2F101&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7C0392c7ba3acc437cf90d08dab1b37fd4%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017681740704283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JzZgJamdmiY18Dqwcjgcw%2B4OclexEcV6zA8owNcg3EE%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAQWR2D5IZ5LNZKWQBIKCKLWD64WVANCNFSM6AAAAAARI5DROU&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7C0392c7ba3acc437cf90d08dab1b37fd4%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017681740704283%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EaNPu9j60EU30rC2CIBurzTvphDP2zhr1m7hHVkidVo%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>

HKJ396 commented 1 year ago

Thanks for your quick response. So I will provide my full set of GWAS summary results. I will recreate my LD matrix by supplying the --r (raw inter-variant allele count correlations) rather than r2 (which reports squared correlations). Thanks! I will recalculate MAF so 1-(0.5 or higher).

HKJ396 commented 1 year ago

Sorry but I have another question, when I recreate the LD matrix, shall I use the full set of GWAS summary results or filtered by p value?

chr1swallace commented 1 year ago

always the full set, but make sure the alleles are the same way round in your two datasets.


From: HKJ396 @.> Sent: 19 October 2022 11:17 To: chr1swallace/coloc @.> Cc: Chris Wallace @.>; Comment @.> Subject: Re: [chr1swallace/coloc] MAF score from GWAS results using COLOC.susie as well as LD matrix question (Issue #101)

Sorry but I have another question, when I recreate the LD matrix, shall I use the full set of GWAS summary results or filtered by p value?

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchr1swallace%2Fcoloc%2Fissues%2F101%23issuecomment-1283768865&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7Ce8e4ca788d9c49caef2008dab1bb33a1%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017714821165959%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KzZYrVJiLqX33J3MufqAdaCO%2BB6KCoI5nK9OYkQEIjE%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAQWR2G4S3PR24Q5WFQ23JDWD7DFNANCNFSM6AAAAAARI5DROU&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7Ce8e4ca788d9c49caef2008dab1bb33a1%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017714821165959%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=x9W%2B7uv7L0lWkUIXDShmCaOh2iZN4CgPzHMToamFrYM%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

HKJ396 commented 1 year ago

When I use the plink command I have set the window size as --ld-window-kb 1000. Is this okay?

Because I'm not doing r2 now, shall I leave setting the r2 value? As in, I set the r2 value above 0.6 before so I am assuming I am not filtering above a particular r2 value now?

chr1swallace commented 1 year ago

you have to pick a sensible window. I don't use plink, so can't advise on its options


From: HKJ396 @.> Sent: 19 October 2022 11:31 To: chr1swallace/coloc @.> Cc: Chris Wallace @.>; Comment @.> Subject: Re: [chr1swallace/coloc] MAF score from GWAS results using COLOC.susie as well as LD matrix question (Issue #101)

When I use the plink command I have set the window size as --ld-window-kb 1000. Is this okay?

Because I'm not doing r2 now, shall I leave setting the r2 value? As in, I set the r2 value above 0.6 before so I am assuming I am not filtering above a particular r2 value now?

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchr1swallace%2Fcoloc%2Fissues%2F101%23issuecomment-1283784873&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7Cbc77d8359a924e6c96de08dab1bd0b9a%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017722737320958%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=g5%2F%2FCASNaVCbgXxmsYWdcSi37XOSryDZ9bkBDt6Kxmw%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAQWR2CMYJ6U2CUNAEDQWVLWD7EW7ANCNFSM6AAAAAARI5DROU&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7Cbc77d8359a924e6c96de08dab1bd0b9a%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017722737320958%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zP8jRchnlQTsRmWbt7rObWjm8BJLEN9WxfqQG4a%2BS%2BE%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

HKJ396 commented 1 year ago

Thank you! So I have provided the full list of SNPs (roughly 7 million) that was used in the GWAS as an input file to calculate r ( which gives me raw inter-variant allele count correlations). The only filter I used was kb window. Is this okay?

chr1swallace commented 1 year ago

I don't know what your GWAS looks like, so I can't say. You need to look at the signal(s). Pick a window which contains the signal and its decay.


From: HKJ396 @.> Sent: 19 October 2022 12:21 To: chr1swallace/coloc @.> Cc: Chris Wallace @.>; Comment @.> Subject: Re: [chr1swallace/coloc] MAF score from GWAS results using COLOC.susie as well as LD matrix question (Issue #101)

Thank you! So I have provided the full list of SNPs (roughly 7 million) that was used in the GWAS as an input file to calculate r ( which gives me raw inter-variant allele count correlations). The only filter I used was kb window. Is this okay?

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchr1swallace%2Fcoloc%2Fissues%2F101%23issuecomment-1283850083&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7Ce2b5c24cb7c24e04c12f08dab1c42553%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017753244518783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=AovrsssV8aprT3YgZf8O07JZPVg%2BTkeM2A3mK7n7xm8%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAQWR2DCH3KHTAZX6XWSXULWD7KTTANCNFSM6AAAAAARI5DROU&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7Ce2b5c24cb7c24e04c12f08dab1c42553%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017753244518783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=eMueOPkzzZ87LgiWrj9sPisC8iLvpjRXar7IzHzauF4%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

HKJ396 commented 1 year ago

So I supply my full set of GWAS results (all the SNPs used in the GWAS) and when you say signal is this the top hits e.g. (all the genome wide significance hits/p value below a 10-6 for example if i haven't got genome wide hits). I then provide a window either side of the top hits? Sorry I am really confused.

chr1swallace commented 1 year ago

ok, then I'm going to take this as a chance to improve my docs. Did you read http://chr1swallace.github.io/coloc/articles/a02_data.html ? [https://chr1swallace.github.io/coloc/logo.png]http://chr1swallace.github.io/coloc/articles/a02_data.html Coloc: data structures • coloc - GitHub Pageshttp://chr1swallace.github.io/coloc/articles/a02_data.html Data structures. The enumeration approaches to coloc coloc.abf and coloc.susie require the same data format. To be flexible, coloc.abf allows a wider variety of inputs, which can sometimes be confusing. This document aims to set out what ideal and acceptable data structures are, as well as covering some points on how you should select the SNPs for study. chr1swallace.github.io


From: HKJ396 @.> Sent: 19 October 2022 13:00 To: chr1swallace/coloc @.> Cc: Chris Wallace @.>; Comment @.> Subject: Re: [chr1swallace/coloc] MAF score from GWAS results using COLOC.susie as well as LD matrix question (Issue #101)

So I supply my full set of GWAS results (all the SNPs used in the GWAS) and when you say signal is the top hits e.g. (all the genome wide hits/p value below a 10-6 for example). I then provide a window either side of the top hits? Sorry I am really confused.

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchr1swallace%2Fcoloc%2Fissues%2F101%23issuecomment-1283898115&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7C4e12da65c4a049b7dffb08dab1c98cd0%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017776447252383%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9m3QZoZXptuHqcmE4ggcLmVNFJM8GK3l177YwdYNNnA%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAQWR2ACWCMJ4TTTRLPDR2DWD7PEXANCNFSM6AAAAAARI5DROU&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7C4e12da65c4a049b7dffb08dab1c98cd0%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017776447252383%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xuLtYkR%2Bwwr9ZYmLyy%2FqOO%2BCST%2FCV%2B%2Bfbl0q2X%2F46ao%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>

HKJ396 commented 1 year ago

I really do apologise. I'm very new to all of this. I've realised rather than supplying GWAS results with filtered p value. I have to provide my top hits (with the lowest p value) which I assume is what sentinel means but also the SNPs around these peaks e.g. 1MB around the peak SNPs? I then cross check these SNPs with the eQTL data? So only include the same SNPs between these two datasets. And then I provide a LD matrix.

chr1swallace commented 1 year ago

Yes

https://chr1swallace.github.io


From: HKJ396 @.> Sent: Wednesday, October 19, 2022 1:59:25 PM To: chr1swallace/coloc @.> Cc: Chris Wallace @.>; Comment @.> Subject: Re: [chr1swallace/coloc] MAF score from GWAS results using COLOC.susie as well as LD matrix question (Issue #101)

I really do apologise. I'm very new to all of this. I've realised rather than supplying GWAS results with filtered p value. I have to provide my top hits (with the lowest p value) which I assume is what sentinel means but also the SNPs around these peaks e.g. 1MB around the peak SNPs? I then cross check these SNPs with the eQTL data? So only include the same SNPs between these two datasets. And then I provide a LD matrix.

— Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fchr1swallace%2Fcoloc%2Fissues%2F101%23issuecomment-1283976277&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7C10ed2a33083043c5bf7a08dab1d1c17d%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017811691457410%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=anRbMkGhEksGYmulMj4erNooaFSr9kGnSxu2UHSbnsc%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAAQWR2CZKUVHOSSROFKJGKTWD7WC3ANCNFSM6AAAAAARI5DROU&data=05%7C01%7Ccew54%40universityofcambridgecloud.onmicrosoft.com%7C10ed2a33083043c5bf7a08dab1d1c17d%7C49a50445bdfa4b79ade3547b4f3986e9%7C0%7C0%7C638017811691613647%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5Kl8uAjYh410X0zTgG17sljwox05%2FdYEmOTBF7R8TGc%3D&reserved=0. You are receiving this because you commented.Message ID: @.***>