Sep 15, 2020 Discussions

shntnu commented 4 years ago

Slides are here 2020-09-15-group-jatin.

jatinarora-upmc commented 4 years ago

In today's call, following concerns were raised regarding association between cell morphology traits and rare variants.

A cell bin means either colony or isolate or intermediate cells.

The following questions are addressed using colony cells.

1. Which features to use?

The major drop in the number of features at correlation cut-off 0.9 suggest that most of the features have correlation >0.9. There are 213 features at cut-off 0.9, and 84 at cut-off 0.8.

2. Which variants and genes to use?

Distribution of variants across donors.

Variants with maf < 0.01, so a variant can be in upto 3 donors. On average, 1450 variants per donor.

what is special about these donors with a lot of variants

I have not removed these outlier donors, since they are from mixed ancestry which is known to have more variants.

Distribution of the number of donors each gene has variants in (or how many donors each gene has variants in?).

Variants with maf < 0.1. Right plot is zoomed-in of left-one. On average, a gene has variants in 8 donors (median) - which means a gene has 3 different variants. For now, I selected genes having variants in 1% to 20% of donors (~3 to 60 donors), which is ~12,000 genes.

Associations with different mafs

Top associations with variants having maf < 0.01	Feature	Gene	Beta	P	Bon-P
Cytoplasm_AreaShape_Zernike_3_1	ZNF436	1.7103445	0.00000002203261	0.05589621	0.05589621
Cells_AreaShape_Zernike_2_2	NMS	1.1757678	0.00000011349305	0.28792914	0.10622767
Cells_Intensity_MinIntensityEdge_AGP	GKAP1	1.3279471	0.00000019958726	0.50634808	0.10622767
Cells_AreaShape_Zernike_4_4	URI1	0.8594038	0.00000022325116	0.56638284	0.10622767
Cytoplasm_Intensity_IntegratedIntensity_Brightfield	CCDC77	0.7849052	0.00000024464779	0.62066558	0.10622767
Cytoplasm_Intensity_IntegratedIntensity_RNA	CCDC77	0.8130849	0.00000025123060	0.63736600	0.10622767
Nuclei_Granularity_1_Brightfield	ZNF436	-1.4254993	0.00000035578895	0.90262804	0.10778758
Cells_RadialDistribution_RadialCV_DNA_1of4	LRFN1	-1.4948284	0.00000036807610	0.93380024	0.10778758
Cytoplasm_Intensity_IntegratedIntensity_Brightfield	OVOL2	1.1530303	0.00000038237974	0.97008822	0.10778758

Top associations with variants having maf < 0.05	Feature	Gene	Beta	P	Bon-P
Cytoplasm_Intensity_IntegratedIntensity_Brightfield	C1orf64	1.3661365	0.000000005049320	0.01291923	0.009832509
Cells_Intensity_IntegratedIntensity_ER	C1orf64	1.3674117	0.000000007685827	0.01966502	0.009832509
Cytoplasm_AreaShape_Zernike_3_1	ZNF436	1.7103445	0.000000022032614	0.05637282	0.018790941
Cells_AreaShape_Zernike_2_2	NMS	1.1757678	0.000000113493050	0.29038423	0.072596056
Cells_Intensity_MinIntensityEdge_AGP	GKAP1	1.3279471	0.000000199587257	0.51066555	0.091828661
Cytoplasm_Intensity_IntegratedIntensity_Brightfield	CCDC77	0.7849052	0.000000244647793	0.62595780	0.091828661
Cytoplasm_Intensity_IntegratedIntensity_RNA	CCDC77	0.8130849	0.000000251230601	0.64280062	0.091828661
Nuclei_Granularity_1_Brightfield	ZNF436	-1.4254993	0.000000355788955	0.91032447	0.097835986
Cells_RadialDistribution_RadialCV_DNA_1of4	LRFN1	-1.4948284	0.000000368076102	0.94176246	0.097835986
Cytoplasm_Intensity_IntegratedIntensity_Brightfield	OVOL2	1.1530303	0.000000382379740	0.97835986	0.097835986

Basically, there are two more associations with C1orf64 gene when i take variants with maf < 0.5 instead of 0.1. The rest is the same.

an example from all cells

The first association is using variants having maf < 0.05, and second one having maf < 0.01. The difference arises due to one variant having maf 0.03 (present in 18 individuals) in our dataset, and 0.048 in Gnomad.

variant maf in cmQTL vs Gnomad

Checking few variants having high maf in cmQTL and low in Gnomad indicates these might be from particular regions e.g. very large indels or repetitive regions.

How many donors a gene should have variants in?

Are significant associations due to very few or a large number of donors with variants?

an example of association driven by 7 donors with variants in ZNF436 gene

variants with maf < 0.05 number of donors with variants in top 10 associations is 6, 6, 7, 6, 4, 15, 15, 7, 4, 6

plot for full data

associations from permuted data with variants with maf < 0.05 number of donors with variants in top 10 associations is 5, 3, 18, 4, 4, 4, 3, 23, 19, 3

ALL features X ALL genes (variants maf < 0.01)

235 permutation for top 10 associated features X ALL genes (variants maf < 0.01)

Do we have any ancestry-specific association?

Since variants could segregate across different populations (ancestries), we investigate whether the observed association are due to any specific ancestry. This should not be the case, since we control for ancestry via genotype PCs, but we can make a cross-check by re-doing rare variant association analysis while not using controlling for ancestry. NO

4. How to decide p value threshold for associations?

Total tests = number of features x number of genes There are multiple ways.

Results with maf < 0.05

Feature	Gene	Beta	P	Bon-P	FDR-P
Cytoplasm_Intensity_IntegratedIntensity_Brightfield	C1orf64	1.3661365	0.000000005049320	0.01291923	0.009832509
Cells_Intensity_IntegratedIntensity_ER	C1orf64	1.3674117	0.000000007685827	0.01966502	0.009832509
Cytoplasm_AreaShape_Zernike_3_1	ZNF436	1.7103445	0.000000022032614	0.05637282	0.018790941
Cells_AreaShape_Zernike_2_2	NMS	1.1757678	0.000000113493050	0.29038423	0.072596056
Cells_Intensity_MinIntensityEdge_AGP	GKAP1	1.3279471	0.000000199587257	0.51066555	0.091828661
Cytoplasm_Intensity_IntegratedIntensity_Brightfield	CCDC77	0.7849052	0.000000244647793	0.62595780	0.091828661
Cytoplasm_Intensity_IntegratedIntensity_RNA	CCDC77	0.8130849	0.000000251230601	0.64280062	0.091828661
Nuclei_Granularity_1_Brightfield	ZNF436	-1.4254993	0.000000355788955	0.91032447	0.097835986
Cells_RadialDistribution_RadialCV_DNA_1of4	LRFN1	-1.4948284	0.000000368076102	0.94176246	0.097835986
Cytoplasm_Intensity_IntegratedIntensity_Brightfield	OVOL2	1.1530303	0.000000382379740	0.97835986	0.097835986

(1) Use Bonferroni correction for multiple testing. This is most uniform threshold but might be a bit too strict. I selected those features to test for associations which had correlation of up-to 0.9, but this still leaves many features with high correlation, and association with same genes (vice versa might be true for genes also). It might not be optimum to also correct for correlated features or genes using Bonferroni correction.

(2) Use FDR-adjusted p values. This could be complemented by permutation test to evaluate if the models are behaving well. Moreover, this approach would correct for both multiple gene and multiple feature testing burdens. (3)

Implementation: Step 0. Filter all feature:gene associations with FDR-adjusted p value < 0.1. From this step onwards, features would imply only these filtered features Step 1. Permute features together, while keeping inter-feature correlation intact. Step 2. Calculate association of features x all genes, and take the lowest p-value. Step 3. Repeat step 1 and step 2 at-least 1000 times, and take p-value from each permutation round (so, total 1000 p-values) and use them to make background (null) distribution. Step 4. Take those feature:gene associations from Step 0 which are within 5% of right tail of background distribution.

distribution of observed p values against background from permutations.

for lowest p value in background (Cytoplasm_Intensity_IntegratedIntensity_RNA and GATA2). This association has 5 donors with variant in GATA2 gene. 617/12301 (5.1%) associations have p < 0.05 -- as expected for a uniform distribution.

another example of very low p value in background distribution 598/12351 (4.9%) associations have p < 0.05

another example, which is association is driven by 3 points. 729/12341 (5.8%) associations have p < 0.05

(3) Use feature-specific background distribution. This might be over sophisticated. While this would correct for multiple gene testing burden, but not multiple features burden. However, this could be used to complement FDR-adjusted p value like in (2).

Supplementary

from original data

variants with maf < 0.01

jatinarora-upmc commented 4 years ago

@shntnu could you also please have a look at this document, and flag me if you see anything wrong with data or models? These plots might also be useful.

jatinarora-upmc commented 4 years ago

broadinstitute / cmQTL