broadinstitute / cmQTL

High-dimensional phenotyping to define the genetic basis of cellular morphology
BSD 3-Clause "New" or "Revised" License
6 stars 0 forks source link

Sep 15, 2020 Discussions #56

Closed shntnu closed 2 years ago

shntnu commented 4 years ago

Slides are here 2020-09-15-group-jatin.

jatinarora-upmc commented 4 years ago

In today's call, following concerns were raised regarding association between cell morphology traits and rare variants.

A cell bin means either colony or isolate or intermediate cells.

The following questions are addressed using colony cells.

1. Which features to use?

The major drop in the number of features at correlation cut-off 0.9 suggest that most of the features have correlation >0.9. There are 213 features at cut-off 0.9, and 84 at cut-off 0.8. image

2. Which variants and genes to use?

Distribution of variants across donors.

Variants with maf < 0.01, so a variant can be in upto 3 donors. image On average, 1450 variants per donor.

image image

what is special about these donors with a lot of variants

image image I have not removed these outlier donors, since they are from mixed ancestry which is known to have more variants.

Distribution of the number of donors each gene has variants in (or how many donors each gene has variants in?).

Variants with maf < 0.1. Right plot is zoomed-in of left-one. image On average, a gene has variants in 8 donors (median) - which means a gene has 3 different variants. For now, I selected genes having variants in 1% to 20% of donors (~3 to 60 donors), which is ~12,000 genes.

Associations with different mafs

Top associations with variants having maf < 0.01 Feature Gene Beta P Bon-P FDR-P
Cytoplasm_AreaShape_Zernike_3_1 ZNF436 1.7103445 0.00000002203261 0.05589621 0.05589621
Cells_AreaShape_Zernike_2_2 NMS 1.1757678 0.00000011349305 0.28792914 0.10622767
Cells_Intensity_MinIntensityEdge_AGP GKAP1 1.3279471 0.00000019958726 0.50634808 0.10622767
Cells_AreaShape_Zernike_4_4 URI1 0.8594038 0.00000022325116 0.56638284 0.10622767
Cytoplasm_Intensity_IntegratedIntensity_Brightfield CCDC77 0.7849052 0.00000024464779 0.62066558 0.10622767
Cytoplasm_Intensity_IntegratedIntensity_RNA CCDC77 0.8130849 0.00000025123060 0.63736600 0.10622767
Nuclei_Granularity_1_Brightfield ZNF436 -1.4254993 0.00000035578895 0.90262804 0.10778758
Cells_RadialDistribution_RadialCV_DNA_1of4 LRFN1 -1.4948284 0.00000036807610 0.93380024 0.10778758
Cytoplasm_Intensity_IntegratedIntensity_Brightfield OVOL2 1.1530303 0.00000038237974 0.97008822 0.10778758

image image image image image image

Top associations with variants having maf < 0.05 Feature Gene Beta P Bon-P FDR-P
Cytoplasm_Intensity_IntegratedIntensity_Brightfield C1orf64 1.3661365 0.000000005049320 0.01291923 0.009832509
Cells_Intensity_IntegratedIntensity_ER C1orf64 1.3674117 0.000000007685827 0.01966502 0.009832509
Cytoplasm_AreaShape_Zernike_3_1 ZNF436 1.7103445 0.000000022032614 0.05637282 0.018790941
Cells_AreaShape_Zernike_2_2 NMS 1.1757678 0.000000113493050 0.29038423 0.072596056
Cells_Intensity_MinIntensityEdge_AGP GKAP1 1.3279471 0.000000199587257 0.51066555 0.091828661
Cytoplasm_Intensity_IntegratedIntensity_Brightfield CCDC77 0.7849052 0.000000244647793 0.62595780 0.091828661
Cytoplasm_Intensity_IntegratedIntensity_RNA CCDC77 0.8130849 0.000000251230601 0.64280062 0.091828661
Nuclei_Granularity_1_Brightfield ZNF436 -1.4254993 0.000000355788955 0.91032447 0.097835986
Cells_RadialDistribution_RadialCV_DNA_1of4 LRFN1 -1.4948284 0.000000368076102 0.94176246 0.097835986
Cytoplasm_Intensity_IntegratedIntensity_Brightfield OVOL2 1.1530303 0.000000382379740 0.97835986 0.097835986

image image image image

Basically, there are two more associations with C1orf64 gene when i take variants with maf < 0.5 instead of 0.1. The rest is the same.

an example from all cells

The first association is using variants having maf < 0.05, and second one having maf < 0.01. The difference arises due to one variant having maf 0.03 (present in 18 individuals) in our dataset, and 0.048 in Gnomad. image image

variant maf in cmQTL vs Gnomad

image image Checking few variants having high maf in cmQTL and low in Gnomad indicates these might be from particular regions e.g. very large indels or repetitive regions.

How many donors a gene should have variants in?

Are significant associations due to very few or a large number of donors with variants?

an example of association driven by 7 donors with variants in ZNF436 gene image

variants with maf < 0.05 image number of donors with variants in top 10 associations is 6, 6, 7, 6, 4, 15, 15, 7, 4, 6

plot for full data image image

associations from permuted data with variants with maf < 0.05 image number of donors with variants in top 10 associations is 5, 3, 18, 4, 4, 4, 3, 23, 19, 3

ALL features X ALL genes (variants maf < 0.01) image

235 permutation for top 10 associated features X ALL genes (variants maf < 0.01) image

Do we have any ancestry-specific association?

Since variants could segregate across different populations (ancestries), we investigate whether the observed association are due to any specific ancestry. This should not be the case, since we control for ancestry via genotype PCs, but we can make a cross-check by re-doing rare variant association analysis while not using controlling for ancestry. NO

4. How to decide p value threshold for associations?

Total tests = number of features x number of genes There are multiple ways.

Results with maf < 0.05

Feature Gene Beta P Bon-P FDR-P
Cytoplasm_Intensity_IntegratedIntensity_Brightfield C1orf64 1.3661365 0.000000005049320 0.01291923 0.009832509
Cells_Intensity_IntegratedIntensity_ER C1orf64 1.3674117 0.000000007685827 0.01966502 0.009832509
Cytoplasm_AreaShape_Zernike_3_1 ZNF436 1.7103445 0.000000022032614 0.05637282 0.018790941
Cells_AreaShape_Zernike_2_2 NMS 1.1757678 0.000000113493050 0.29038423 0.072596056
Cells_Intensity_MinIntensityEdge_AGP GKAP1 1.3279471 0.000000199587257 0.51066555 0.091828661
Cytoplasm_Intensity_IntegratedIntensity_Brightfield CCDC77 0.7849052 0.000000244647793 0.62595780 0.091828661
Cytoplasm_Intensity_IntegratedIntensity_RNA CCDC77 0.8130849 0.000000251230601 0.64280062 0.091828661
Nuclei_Granularity_1_Brightfield ZNF436 -1.4254993 0.000000355788955 0.91032447 0.097835986
Cells_RadialDistribution_RadialCV_DNA_1of4 LRFN1 -1.4948284 0.000000368076102 0.94176246 0.097835986
Cytoplasm_Intensity_IntegratedIntensity_Brightfield OVOL2 1.1530303 0.000000382379740 0.97835986 0.097835986

(1) Use Bonferroni correction for multiple testing. This is most uniform threshold but might be a bit too strict. I selected those features to test for associations which had correlation of up-to 0.9, but this still leaves many features with high correlation, and association with same genes (vice versa might be true for genes also). It might not be optimum to also correct for correlated features or genes using Bonferroni correction.

(2) Use FDR-adjusted p values. This could be complemented by permutation test to evaluate if the models are behaving well. Moreover, this approach would correct for both multiple gene and multiple feature testing burdens. (3)

Implementation: Step 0. Filter all feature:gene associations with FDR-adjusted p value < 0.1. From this step onwards, features would imply only these filtered features Step 1. Permute features together, while keeping inter-feature correlation intact. Step 2. Calculate association of features x all genes, and take the lowest p-value. Step 3. Repeat step 1 and step 2 at-least 1000 times, and take p-value from each permutation round (so, total 1000 p-values) and use them to make background (null) distribution. Step 4. Take those feature:gene associations from Step 0 which are within 5% of right tail of background distribution.

distribution of observed p values against background from permutations. image

for lowest p value in background (Cytoplasm_Intensity_IntegratedIntensity_RNA and GATA2). This association has 5 donors with variant in GATA2 gene. image image image 617/12301 (5.1%) associations have p < 0.05 -- as expected for a uniform distribution.

another example of very low p value in background distribution image image 598/12351 (4.9%) associations have p < 0.05

another example, which is association is driven by 3 points. image image 729/12341 (5.8%) associations have p < 0.05

(3) Use feature-specific background distribution. This might be over sophisticated. While this would correct for multiple gene testing burden, but not multiple features burden. However, this could be used to complement FDR-adjusted p value like in (2).

Supplementary

from original data image image

variants with maf < 0.01 image

image image

jatinarora-upmc commented 4 years ago

@shntnu could you also please have a look at this document, and flag me if you see anything wrong with data or models? These plots might also be useful.

jatinarora-upmc commented 4 years ago

image

image