Open hsun3163 opened 2 years ago
@hsun3163 what happens if this is violated?
As discussed, we may want to correct for multiple testing and get gene level results using other methods. So this may not be a problem after all -- we only analyze each CpG and later annotate them at data integration analysis.
You can also try posting at tensorQTL repo and ask?
@hsun3163 what happens if this is violated?
As discussed, we may want to correct for multiple testing and get gene level results using other methods. So this may not be a problem after all -- we only analyze each CpG and later annotate them at data integration analysis.
You can also try posting at tensorQTL repo and ask?
If this is violates, then the phenotype group mechanism cannot be used (An error will be thrown), and hence remove the need to annotate the methylation matrix with genes beforehand (we could still do it but don't have to do it before association testing) .
As it turn out, SESAME annotation seems to have a good annotation that we can rely on, as each cpg site mapped to only 1 gene`
> all(pb_anno_tb%>%group_by( `pb_anno %>% names`)%>%count()%>%pull(n) == 1)
[1] TRUE
pb_anno_tb
is the annotation of the first 20000 cp probes in the HM450 platform,
After a more systematic review of the the first 20000 probes, following are the distribution of number of genes that each probe was mapped to:
>sesameData_annoProbes(PB,column = c("gene_id")) -> ngene
> map((pb_anno_tb$gene_id),~read.table(text = .x, sep = ",")%>%ncol) -> ngene
> tibble(ngene = ngene%>%unlist)%>%dplyr::group_by(ngene)%>%dplyr::count(ngene)
# A tibble: 16 x 2
# Groups: ngene [16]
ngene n
<int> <int>
1 1 17389
2 2 2372
3 3 191
4 4 23
5 5 5
6 6 2
7 8 4
8 10 1
9 11 1
10 12 1
11 13 2
12 14 2
13 19 2
14 20 2
15 21 1
16 22 2
ngene is the number of gene that are in the gene_id column of pb_anno_tb
So 2 probes mapped to 22 genes? I wonder what the maximum number is if you consider all the probes
So 2 probes mapped to 22 genes? I wonder what the maximum number is if you consider all the probes
Yes, both of them mapped to the same Protocadherin Gamma Subfamily of genes:
> pb_anno_tb["cg00104845"]$gene_name
[1] "PCDHGA1,PCDHGA2,PCDHGA3,PCDHGB1,PCDHGA4,PCDHGB2,PCDHGA5,PCDHGB3,PCDHGA6,PCDHGA7,PCDHGB4,PCDHGA8,PCDHGB5,PCDHGA9,PCDHGB6,PCDHGA10,PCDHGB7,PCDHGA11,PCDHGA12,PCDHGC3,PCDHGC4,PCDHGC5"
> cg00283283
> pb_anno_tb["cg00283283"]$gene_name
[1] "PCDHGA1,PCDHGA2,PCDHGA3,PCDHGB1,PCDHGA4,PCDHGB2,PCDHGA5,PCDHGB3,PCDHGA6,PCDHGA7,PCDHGB4,PCDHGA8,PCDHGB5,PCDHGA9,PCDHGB6,PCDHGA10,PCDHGB7,PCDHGA11,PCDHGA12,PCDHGC3,PCDHGC4,PCDHGC5"
Until we find a better way to solve this from the missmethyl package, let's just output the cpg beta and M bed.gz files by their original position and make a companion file documenting the annotation with gene_id
As per our discussion, we will define the molecular phenotype object of cpg site by a cis window around the TSS of each gene. However, one of the problems of this approach is how to handle the cpg site that is presented at the intersection of two cis windows.
Previously in the TWAS project, we handle the similar problem with SNP by assigning them to different genes simultaneously and creating two SNP-gene pairs.
However, in the tensorQTL setting, the phenotype vs phenotype-group relationship is required to be unique. As indicates below, each ID can only be included into 1 phenotype group, and assigned with 1 TSS (and hence 1 cis window)