Biobank QC and GWAS - Githubissues

Nealelab / UK_Biobank_GWAS

Overview of the data QC, code, and GWAS summary output from the 2017 UK Biobank data release

347 stars 107 forks source link

Biobank QC and GWAS #21

Closed montenegrina closed 4 years ago

montenegrina commented 5 years ago

Hello,

I was wondering if I could as you did you in the process of doing QC remove all subjects who where under category "Participant excluded from kinship inference process”?

or did you just choose just "No kinship found" and "NA" from the genetic_kinship_to_other_participants_f22021_0_0 Biobank data field?

I am trying to do GWAS myself and I am getting around 17000 more subjects than you guys used in your GWAS.

Can you please advise, Thanks Ana

howrigan commented 5 years ago

Hi Ana,

In our analysis, we didn't use information from the genetic_kinship_to_other_participants_f22021_0_0 UK Biobank data field, but used the information provided from the ukb_sqc_v2.txt file that accompanied the genotype and imputed data release. Access to download these files were provided for each UKB application.

This site provides the column descriptions for each file in the release: https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/ukb_genetic_data_description.txt

In particular, we used the "used.in.pca.calculation" column under "Sample QC" descriptions to subset to unrelated individuals. This strategy means that we did not provide our own specific relatedness threshold, but instead adopted the threshold used by the UKB analysis team when they ran the PCA analysis.

There is a separate kinship table provided in the data release (ukb[application number]_rel_s[sample size].dat), which has the IBD values for any pair estimated with > 3rd degree relatedness.

Hope this helps! Dan

montenegrina commented 5 years ago

Hi Dan,

thank you so much for getting back to me.

Yes I did use this function https://www.rdocumentation.org/packages/ukbtools/versions/0.11.3/topics/ukb_gen_samples_to_remove and relatedness table from UK Biobank to deal with > 3rd degree relatedness.

But for some reason I did not get this ukb_sqc_v2.txt file with the rest of the files I requested and received. I guess I would need to write to them.

But just to confirm in this file ukb_sqc_v2.txt file there is a column used.in.pca.calculation ?

Is this column any different than used_in_genetic_principal_components_f22020_0_0 data filed? (I used that one)

Also I wanted to ask you about controlling for the Array type, there are those two:

Affymetrix UK BiLEVE Axiom array on an initial 50,000
Affymetrix UK Biobank Axiom array on remaining 450,000

so to determined which subject belong to which array did you use this data filed: 22051? Did you control for array type?

Is there is anywhere description of which columns and how did you use to get those 337,199 QC positive individuals? I did all QC except for that one with ukb_sqc_v2.txt file (dealt with relatedness my way) and I got 337147 subject.

Cheers, Ana

On Fri, Oct 11, 2019 at 2:25 PM Daniel P Howrigan notifications@github.com wrote:

Hi Ana,

In our analysis, we didn't use information from the genetic_kinship_to_other_participants_f22021_0_0 UK Biobank data field, but used the information provided from the ukb_sqc_v2.txt file that accompanied the genotype and imputed data release. Access to download these files were provided for each UKB application.

This site provides the column descriptions for each file in the release:

https://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/ukb_genetic_data_description.txt

In particular, we used the "used.in.pca.calculation" column under "Sample QC" descriptions to subset to unrelated individuals. This strategy means that we did not provide our own specific relatedness threshold, but instead adopted the threshold used by the UKB analysis team when they ran the PCA analysis.

There is a separate kinship table provided in the data release (ukb[application number]_rel_s[sample size].dat), which has the IBD values for any pair estimated with > 3rd degree relatedness.

Hope this helps! Dan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nealelab/UK_Biobank_GWAS/issues/21?email_source=notifications&email_token=ACF3RTGPU3UBDW454UDYZ4TQODHI5A5CNFSM4I7RSNCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBA7HQY#issuecomment-541193155, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACF3RTFJCP4LOUEG7TPPDYTQODHI5ANCNFSM4I7RSNCA .

howrigan commented 5 years ago

Hi Ana,

For ukb_sqc_v2 file, they may have changed the file name in v3 to ukb[application number]_sample_qc.tsv, but the table is largely similar. They may have updated the files/column names again since we downloaded data, and "used_in_genetic_principal_components_f22020_0_0" sounds like it would be similar. The numbers your refer to, 22020 and 22051, are unfamiliar to me, so we didn't draw from those files/columns.

As for array, we did not control for it, although it isn't a bad idea to do that, as there is likely to be array specific effects that are still present even after imputation. If your goal is to compare directly to our rapid-GWAS approach, don't include array, but otherwise I think it would have been a good thing to add in hindsight.

Finally, given that you are only 52 samples different after your QC, my hunch is that maybe there was some participant withdrawl between applications. Another guess could be using "greater than/equal" vs "greater than" in some filter, where a number of individuals were right on the cutoff. These are only guesses, but with only a small difference in sample sizes, I would say things are matching up pretty well and there is unlikely to be clear differences in the downstream association mapping.

best, Dan

montenegrina commented 5 years ago

Hi Dan,

thank you so much for this elaborate replay. Just one more question. I am using your website to compare with my results. So for example for Diabetic Eye Disease, which is what I study, I use both HES and Questionnaire data to create my pheno file. I do get a bit bigger p values then on your website. My question is: did you use only Questionnaire data (did not include HES) and did you use any covariates in regression? I do use: age, sex, type of diabetes and the first 10 PCs.

Regards, Ana

On Sat, Oct 12, 2019 at 1:59 PM Daniel P Howrigan notifications@github.com wrote:

Hi Ana,

For ukb_sqc_v2 file, they may have changed the file name in v3 to ukb[application number]_sample_qc.tsv, but the table is largely similar. They may have updated the files/column names again since we downloaded data, and "used_in_genetic_principal_components_f22020_0_0" sounds like it would be similar. The numbers your refer to, 22020 and 22051, are unfamiliar to me, so we didn't draw from those files/columns.

As for array, we did not control for it, although it isn't a bad idea to do that, as there is likely to be array specific effects that are still present even after imputation. If your goal is to compare directly to our rapid-GWAS approach, don't include array, but otherwise I think it would have been a good thing to add in hindsight.

Finally, given that you are only 52 samples different after your QC, my hunch is that maybe there was some participant withdrawl between applications. Another guess could be using "greater than/equal" vs "greater than" in some filter, where a number of individuals were right on the cutoff. These are only guesses, but with only a small difference in sample sizes, I would say things are matching up pretty well and there is unlikely to be clear differences in the downstream association mapping.

best, Dan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nealelab/UK_Biobank_GWAS/issues/21?email_source=notifications&email_token=ACF3RTHZM5J2ZPMAFOK2IXTQOINBDA5CNFSM4I7RSNCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBCGEKY#issuecomment-541352491, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACF3RTGTDJLKUV5CZJ7J2L3QOINBDANCNFSM4I7RSNCA .

montenegrina commented 5 years ago

Hi Dan,

sorry to bother you again. I would just like to compare mine and your results:

In my GWAS this is what I get for these SNPs:

rs11867934 my pval is: 0.0928623 rs8065832 my pval is: 0.0391712 rs7218795 my pval is: 0.00186646 rs1708618 my pval is: 0.0815713

On your website I searched for these SNPs and than for 20002_1276: Non-cancer illness code, self-reported: diabetic eye disease and then for these SNPs:

rs11867934 your pval is: 8.1e-1 rs8065832 your pval is: 1.5e-3 rs7218795 your pval is: 4.3e-2 rs1708618 your pval is: 6.8e-3

So this means that you used these data fields: 20002 and 1276? Bdw I could not find 1276 data filed on UK Biobank website.

I used to define controls: Only questionnaire data: for CASES: data field 6148, who had answered: Diabetes related eye disease for CONTROLS: data filed 2443, who answered: Yes

I did QC for: Caucasian only, who were used_in_genetic_principal_components,heterozygosity and missing rate,sex_chromosome_aneuploidy and relatedness.

Got as I mentioned before I got 337147 subjects. And in GWAS had covariates: genetic sex, age, type of diabetes and the first 10 PCs

Can you please give an insight as to why we get these different results?

Thanks Ana

On Mon, Oct 14, 2019 at 12:16 PM Ana Marija sokovic.anamarija@gmail.com wrote:

Hi Dan,

thank you so much for this elaborate replay. Just one more question. I am using your website to compare with my results. So for example for Diabetic Eye Disease, which is what I study, I use both HES and Questionnaire data to create my pheno file. I do get a bit bigger p values then on your website. My question is: did you use only Questionnaire data (did not include HES) and did you use any covariates in regression? I do use: age, sex, type of diabetes and the first 10 PCs.

Regards, Ana

On Sat, Oct 12, 2019 at 1:59 PM Daniel P Howrigan < notifications@github.com> wrote:

Hi Ana,

For ukb_sqc_v2 file, they may have changed the file name in v3 to ukb[application number]_sample_qc.tsv, but the table is largely similar. They may have updated the files/column names again since we downloaded data, and "used_in_genetic_principal_components_f22020_0_0" sounds like it would be similar. The numbers your refer to, 22020 and 22051, are unfamiliar to me, so we didn't draw from those files/columns.

As for array, we did not control for it, although it isn't a bad idea to do that, as there is likely to be array specific effects that are still present even after imputation. If your goal is to compare directly to our rapid-GWAS approach, don't include array, but otherwise I think it would have been a good thing to add in hindsight.

Finally, given that you are only 52 samples different after your QC, my hunch is that maybe there was some participant withdrawl between applications. Another guess could be using "greater than/equal" vs "greater than" in some filter, where a number of individuals were right on the cutoff. These are only guesses, but with only a small difference in sample sizes, I would say things are matching up pretty well and there is unlikely to be clear differences in the downstream association mapping.

best, Dan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nealelab/UK_Biobank_GWAS/issues/21?email_source=notifications&email_token=ACF3RTHZM5J2ZPMAFOK2IXTQOINBDA5CNFSM4I7RSNCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBCGEKY#issuecomment-541352491, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACF3RTGTDJLKUV5CZJ7J2L3QOINBDANCNFSM4I7RSNCA .

howrigan commented 5 years ago

Hi Ana,

Thanks for following up. Below are my responses:

Did you use only Questionnaire data (did not include HES)

We didn't include HES in the rapid-GWAS, as they were released afterwards. We will likely run these phenotypes in a subsequent rapid-GWAS run.

did you use any covariates in regression?

Yes the covariates we used are listed here: https://github.com/Nealelab/UK_Biobank_GWAS#imputed-v3-association-model

So this means that you used these data fields: 20002 and 1276?

I couldn't find anything on 1276 either, although I do find diabetic eye disease listed under 20002 data field (neurology/eye/psychiatry -> eye/eyelid problem -> diabetic eye disease -> 1395 cases). Not all codes used will correspond directly to a data field in UK Biobank, and may be a result of recoding by PHESANT (see https://github.com/astheeggeggs/PHESANT for details). Best to look into the phenotype summary files and the PHESANT notes to get more details on how the phenotype was derived.

For 20002_1276, the phenotype summary file lists 699 cases / 360442 controls, which may explain the discrepancy in p-values. Choice of covariates and samples will also change p-values, particularly when there is no strong association.

Finally, for diabetic eye disease, you can also take a look at this GWAS:

6148_1 - "Eye problems/disorders: Diabetes related eye disease"

This one has 2249 cases / 115641 controls

http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=6148

best, Dan

montenegrina commented 5 years ago

Hi Dan,

thanks for getting back to me. Yes I was wondering about your 6148_1. Why does 6148_1 has 108817 samples, while others around 377000?

Thanks Ana

On Wed, Oct 16, 2019 at 1:53 PM Daniel P Howrigan notifications@github.com wrote:

Hi Ana,

Thanks for following up. Below are my responses:

Did you use only Questionnaire data (did not include HES)

We didn't include HES in the rapid-GWAS, as they were released afterwards. We will likely run these phenotypes in a subsequent rapid-GWAS run.

did you use any covariates in regression?

Yes the covariates we used are listed here: https://github.com/Nealelab/UK_Biobank_GWAS#imputed-v3-association-model

So this means that you used these data fields: 20002 and 1276?

I couldn't find anything on 1276 either, although I do find diabetic eye disease listed under 20002 data field (neurology/eye/psychiatry -> eye/eyelid problem -> diabetic eye disease -> 1395 cases). Not all codes used will correspond directly to a data field in UK Biobank, and may be a result of recoding by PHESANT (see https://github.com/astheeggeggs/PHESANT for details). Best to look into the phenotype summary files and the PHESANT notes to get more details on how the phenotype was derived.

For 20002_1276, the phenotype summary file lists 699 cases / 360442 controls, which may explain the discrepancy in p-values. Choice of covariates and samples will also change p-values, particularly when there is no strong association.

Finally, for diabetic eye disease, you can also take a look at this GWAS:

6148_1 - "Eye problems/disorders: Diabetes related eye disease"

This one has 2249 cases / 115641 controls

http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=6148

best, Dan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nealelab/UK_Biobank_GWAS/issues/21?email_source=notifications&email_token=ACF3RTAKIFWVGZJ72QG7ENLQO5PJ7A5CNFSM4I7RSNCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBNRZXY#issuecomment-542842079, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACF3RTFUZFU4EME4NWT6LRTQO5PJ7ANCNFSM4I7RSNCA .

howrigan commented 5 years ago

Hi Ana,

The quick answer is that some phenotypes use only non-missing values (e.g. only individuals who provided an answer the question), whereas other phenotypes, such as the ICD10 codes, treated all individuals without the code as a control.

Dan

montenegrina commented 5 years ago

I used extract data field 6148 where any value is: “Diabetes related eye disease” to be the CASE and in any diabetes_diagnosed_by_doctor_f2443 who have answered "Yes" to be CONTROL (in this scenario I used only questionnaire data)

and after doing all QCs I got 14696 CONTROLS 2334 CASES

does that seem right to you?

On Wed, Oct 16, 2019 at 3:49 PM Daniel P Howrigan notifications@github.com wrote:

Hi Ana,

The quick answer is that some phenotypes use only non-missing values (e.g. only individuals who provided an answer the question), whereas other phenotypes, such as the ICD10 codes, treated all individuals without the code as a control.

Dan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nealelab/UK_Biobank_GWAS/issues/21?email_source=notifications&email_token=ACF3RTEBTCVRVGNSXBFA5A3QO545LA5CNFSM4I7RSNCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBN4M2Y#issuecomment-542885483, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACF3RTHXFNPS7PBVECEQBMDQO545LANCNFSM4I7RSNCA .

howrigan commented 5 years ago

If the categories you are selecting on is "have you been diagnosed with eye disease conditional on having a diabetes diagnosis?", that looks correct.

In our rapid-GWAS, the automated PHESANT coding for the 6148_1 GWAS is selecting on "Have you been diagnosed with diabetes related eye disease?" among all participants who answered the eyesight touchscreen questionnaire.

The control selection is very different between those two, and will lead to different results (although strongly assoc. SNPs in cases are likely to pop up in both GWAS). This is a perfect scenario for why our rapid-GWAS approach has inherent limitations when asking more nuanced phenotypic questions, and is better seen as an "initial" lookup for genetic association signals in UKB.

montenegrina commented 5 years ago

Thank you so much for all the insights, it is very valuable to me!

On Wed, Oct 16, 2019 at 4:16 PM Daniel P Howrigan notifications@github.com wrote:

If the categories you are selecting on is "have you been diagnosed with eye disease conditional on having a diabetes diagnosis?", that looks correct.

In our rapid-GWAS, the automated PHESANT coding for the 6148_1 GWAS is selecting on "Have you been diagnosed with diabetes related eye disease?" among all participants who answered the eyesight touchscreen questionnaire.

The control selection is very different between those two, and will lead to different results (although strongly assoc. SNPs in cases are likely to pop up in both GWAS). This is a perfect scenario for why our rapid-GWAS approach has inherent limitations when asking more nuanced phenotypic questions, and is better seen as an "initial" lookup for genetic association signals in UKB.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Nealelab/UK_Biobank_GWAS/issues/21?email_source=notifications&email_token=ACF3RTFARSHG4VDJVYM3KJ3QO6AEDA5CNFSM4I7RSNCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBN6ZTQ#issuecomment-542895310, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACF3RTDWQH3Q26ZXFFL64ITQO6AEDANCNFSM4I7RSNCA .