Closed shntnu closed 2 years ago
In today's call, following concerns were raised regarding association between cell morphology traits and rare variants.
A cell bin means either colony or isolate or intermediate cells.
The following questions are addressed using colony cells.
1. Which features to use?
The major drop in the number of features at correlation cut-off 0.9 suggest that most of the features have correlation >0.9. There are 213 features at cut-off 0.9, and 84 at cut-off 0.8.
Variants with maf < 0.01, so a variant can be in upto 3 donors. On average, 1450 variants per donor.
what is special about these donors with a lot of variants
I have not removed these outlier donors, since they are from mixed ancestry which is known to have more variants.
Variants with maf < 0.1. Right plot is zoomed-in of left-one. On average, a gene has variants in 8 donors (median) - which means a gene has 3 different variants. For now, I selected genes having variants in 1% to 20% of donors (~3 to 60 donors), which is ~12,000 genes.
Top associations with variants having maf < 0.01 | Feature | Gene | Beta | P | Bon-P | FDR-P |
---|---|---|---|---|---|---|
Cytoplasm_AreaShape_Zernike_3_1 | ZNF436 | 1.7103445 | 0.00000002203261 | 0.05589621 | 0.05589621 | |
Cells_AreaShape_Zernike_2_2 | NMS | 1.1757678 | 0.00000011349305 | 0.28792914 | 0.10622767 | |
Cells_Intensity_MinIntensityEdge_AGP | GKAP1 | 1.3279471 | 0.00000019958726 | 0.50634808 | 0.10622767 | |
Cells_AreaShape_Zernike_4_4 | URI1 | 0.8594038 | 0.00000022325116 | 0.56638284 | 0.10622767 | |
Cytoplasm_Intensity_IntegratedIntensity_Brightfield | CCDC77 | 0.7849052 | 0.00000024464779 | 0.62066558 | 0.10622767 | |
Cytoplasm_Intensity_IntegratedIntensity_RNA | CCDC77 | 0.8130849 | 0.00000025123060 | 0.63736600 | 0.10622767 | |
Nuclei_Granularity_1_Brightfield | ZNF436 | -1.4254993 | 0.00000035578895 | 0.90262804 | 0.10778758 | |
Cells_RadialDistribution_RadialCV_DNA_1of4 | LRFN1 | -1.4948284 | 0.00000036807610 | 0.93380024 | 0.10778758 | |
Cytoplasm_Intensity_IntegratedIntensity_Brightfield | OVOL2 | 1.1530303 | 0.00000038237974 | 0.97008822 | 0.10778758 |
Top associations with variants having maf < 0.05 | Feature | Gene | Beta | P | Bon-P | FDR-P |
---|---|---|---|---|---|---|
Cytoplasm_Intensity_IntegratedIntensity_Brightfield | C1orf64 | 1.3661365 | 0.000000005049320 | 0.01291923 | 0.009832509 | |
Cells_Intensity_IntegratedIntensity_ER | C1orf64 | 1.3674117 | 0.000000007685827 | 0.01966502 | 0.009832509 | |
Cytoplasm_AreaShape_Zernike_3_1 | ZNF436 | 1.7103445 | 0.000000022032614 | 0.05637282 | 0.018790941 | |
Cells_AreaShape_Zernike_2_2 | NMS | 1.1757678 | 0.000000113493050 | 0.29038423 | 0.072596056 | |
Cells_Intensity_MinIntensityEdge_AGP | GKAP1 | 1.3279471 | 0.000000199587257 | 0.51066555 | 0.091828661 | |
Cytoplasm_Intensity_IntegratedIntensity_Brightfield | CCDC77 | 0.7849052 | 0.000000244647793 | 0.62595780 | 0.091828661 | |
Cytoplasm_Intensity_IntegratedIntensity_RNA | CCDC77 | 0.8130849 | 0.000000251230601 | 0.64280062 | 0.091828661 | |
Nuclei_Granularity_1_Brightfield | ZNF436 | -1.4254993 | 0.000000355788955 | 0.91032447 | 0.097835986 | |
Cells_RadialDistribution_RadialCV_DNA_1of4 | LRFN1 | -1.4948284 | 0.000000368076102 | 0.94176246 | 0.097835986 | |
Cytoplasm_Intensity_IntegratedIntensity_Brightfield | OVOL2 | 1.1530303 | 0.000000382379740 | 0.97835986 | 0.097835986 |
Basically, there are two more associations with C1orf64 gene when i take variants with maf < 0.5 instead of 0.1. The rest is the same.
an example from all cells
The first association is using variants having maf < 0.05, and second one having maf < 0.01. The difference arises due to one variant having maf 0.03 (present in 18 individuals) in our dataset, and 0.048 in Gnomad.
variant maf in cmQTL vs Gnomad
Checking few variants having high maf in cmQTL and low in Gnomad indicates these might be from particular regions e.g. very large indels or repetitive regions.
Are significant associations due to very few or a large number of donors with variants?
an example of association driven by 7 donors with variants in ZNF436 gene
variants with maf < 0.05 number of donors with variants in top 10 associations is 6, 6, 7, 6, 4, 15, 15, 7, 4, 6
plot for full data
associations from permuted data with variants with maf < 0.05 number of donors with variants in top 10 associations is 5, 3, 18, 4, 4, 4, 3, 23, 19, 3
ALL features X ALL genes (variants maf < 0.01)
235 permutation for top 10 associated features X ALL genes (variants maf < 0.01)
Since variants could segregate across different populations (ancestries), we investigate whether the observed association are due to any specific ancestry. This should not be the case, since we control for ancestry via genotype PCs, but we can make a cross-check by re-doing rare variant association analysis while not using controlling for ancestry. NO
Total tests = number of features x number of genes There are multiple ways.
Results with maf < 0.05
Feature | Gene | Beta | P | Bon-P | FDR-P |
---|---|---|---|---|---|
Cytoplasm_Intensity_IntegratedIntensity_Brightfield | C1orf64 | 1.3661365 | 0.000000005049320 | 0.01291923 | 0.009832509 |
Cells_Intensity_IntegratedIntensity_ER | C1orf64 | 1.3674117 | 0.000000007685827 | 0.01966502 | 0.009832509 |
Cytoplasm_AreaShape_Zernike_3_1 | ZNF436 | 1.7103445 | 0.000000022032614 | 0.05637282 | 0.018790941 |
Cells_AreaShape_Zernike_2_2 | NMS | 1.1757678 | 0.000000113493050 | 0.29038423 | 0.072596056 |
Cells_Intensity_MinIntensityEdge_AGP | GKAP1 | 1.3279471 | 0.000000199587257 | 0.51066555 | 0.091828661 |
Cytoplasm_Intensity_IntegratedIntensity_Brightfield | CCDC77 | 0.7849052 | 0.000000244647793 | 0.62595780 | 0.091828661 |
Cytoplasm_Intensity_IntegratedIntensity_RNA | CCDC77 | 0.8130849 | 0.000000251230601 | 0.64280062 | 0.091828661 |
Nuclei_Granularity_1_Brightfield | ZNF436 | -1.4254993 | 0.000000355788955 | 0.91032447 | 0.097835986 |
Cells_RadialDistribution_RadialCV_DNA_1of4 | LRFN1 | -1.4948284 | 0.000000368076102 | 0.94176246 | 0.097835986 |
Cytoplasm_Intensity_IntegratedIntensity_Brightfield | OVOL2 | 1.1530303 | 0.000000382379740 | 0.97835986 | 0.097835986 |
(1) Use Bonferroni correction for multiple testing. This is most uniform threshold but might be a bit too strict. I selected those features to test for associations which had correlation of up-to 0.9, but this still leaves many features with high correlation, and association with same genes (vice versa might be true for genes also). It might not be optimum to also correct for correlated features or genes using Bonferroni correction.
(2) Use FDR-adjusted p values. This could be complemented by permutation test to evaluate if the models are behaving well. Moreover, this approach would correct for both multiple gene and multiple feature testing burdens. (3)
Implementation: Step 0. Filter all feature:gene associations with FDR-adjusted p value < 0.1. From this step onwards, features would imply only these filtered features Step 1. Permute features together, while keeping inter-feature correlation intact. Step 2. Calculate association of features x all genes, and take the lowest p-value. Step 3. Repeat step 1 and step 2 at-least 1000 times, and take p-value from each permutation round (so, total 1000 p-values) and use them to make background (null) distribution. Step 4. Take those feature:gene associations from Step 0 which are within 5% of right tail of background distribution.
distribution of observed p values against background from permutations.
for lowest p value in background (Cytoplasm_Intensity_IntegratedIntensity_RNA and GATA2). This association has 5 donors with variant in GATA2 gene. 617/12301 (5.1%) associations have p < 0.05 -- as expected for a uniform distribution.
another example of very low p value in background distribution 598/12351 (4.9%) associations have p < 0.05
another example, which is association is driven by 3 points. 729/12341 (5.8%) associations have p < 0.05
(3) Use feature-specific background distribution. This might be over sophisticated. While this would correct for multiple gene testing burden, but not multiple features burden. However, this could be used to complement FDR-adjusted p value like in (2).
from original data
variants with maf < 0.01
@shntnu could you also please have a look at this document, and flag me if you see anything wrong with data or models? These plots might also be useful.
Slides are here 2020-09-15-group-jatin.