How to define cpg site at overlapping cis windows?

cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium

https://cumc.github.io/xqtl-protocol/

MIT License

41 stars 43 forks source link

How to define cpg site at overlapping cis windows? #315

Open hsun3163 opened 2 years ago

hsun3163 commented 2 years ago

As per our discussion, we will define the molecular phenotype object of cpg site by a cis window around the TSS of each gene. However, one of the problems of this approach is how to handle the cpg site that is presented at the intersection of two cis windows.

Previously in the TWAS project, we handle the similar problem with SNP by assigning them to different genes simultaneously and creating two SNP-gene pairs.

However, in the tensorQTL setting, the phenotype vs phenotype-group relationship is required to be unique. As indicates below, each ID can only be included into 1 phenotype group, and assigned with 1 TSS (and hence 1 cis window)

chr	start	end	ID	Group
1	1000	1001	A	G1
1	1000	1001	B	G1
1	2000	2001	C	G2
1	2000	2001	D	G2

gaow commented 2 years ago

@hsun3163 what happens if this is violated?

As discussed, we may want to correct for multiple testing and get gene level results using other methods. So this may not be a problem after all -- we only analyze each CpG and later annotate them at data integration analysis.

You can also try posting at tensorQTL repo and ask?

hsun3163 commented 2 years ago

@hsun3163 what happens if this is violated?

As discussed, we may want to correct for multiple testing and get gene level results using other methods. So this may not be a problem after all -- we only analyze each CpG and later annotate them at data integration analysis.

You can also try posting at tensorQTL repo and ask?

If this is violates, then the phenotype group mechanism cannot be used (An error will be thrown), and hence remove the need to annotate the methylation matrix with genes beforehand (we could still do it but don't have to do it before association testing) .

hsun3163 commented 2 years ago

As it turn out, SESAME annotation seems to have a good annotation that we can rely on, as each cpg site mapped to only 1 gene`


> all(pb_anno_tb%>%group_by( `pb_anno %>% names`)%>%count()%>%pull(n)  == 1)
[1] TRUE

pb_anno_tb is the annotation of the first 20000 cp probes in the HM450 platform,

hsun3163 commented 2 years ago

After a more systematic review of the the first 20000 probes, following are the distribution of number of genes that each probe was mapped to:

>sesameData_annoProbes(PB,column = c("gene_id")) -> ngene
> map((pb_anno_tb$gene_id),~read.table(text = .x, sep =  ",")%>%ncol) -> ngene
> tibble(ngene = ngene%>%unlist)%>%dplyr::group_by(ngene)%>%dplyr::count(ngene)
# A tibble: 16 x 2
# Groups:   ngene [16]
   ngene     n
   <int> <int>
 1     1 17389
 2     2  2372
 3     3   191
 4     4    23
 5     5     5
 6     6     2
 7     8     4
 8    10     1
 9    11     1
10    12     1
11    13     2
12    14     2
13    19     2
14    20     2
15    21     1
16    22     2

ngene is the number of gene that are in the gene_id column of pb_anno_tb

gaow commented 2 years ago

So 2 probes mapped to 22 genes? I wonder what the maximum number is if you consider all the probes

hsun3163 commented 2 years ago

So 2 probes mapped to 22 genes? I wonder what the maximum number is if you consider all the probes

Yes, both of them mapped to the same Protocadherin Gamma Subfamily of genes:

> pb_anno_tb["cg00104845"]$gene_name
[1] "PCDHGA1,PCDHGA2,PCDHGA3,PCDHGB1,PCDHGA4,PCDHGB2,PCDHGA5,PCDHGB3,PCDHGA6,PCDHGA7,PCDHGB4,PCDHGA8,PCDHGB5,PCDHGA9,PCDHGB6,PCDHGA10,PCDHGB7,PCDHGA11,PCDHGA12,PCDHGC3,PCDHGC4,PCDHGC5"
> cg00283283
> pb_anno_tb["cg00283283"]$gene_name
[1] "PCDHGA1,PCDHGA2,PCDHGA3,PCDHGB1,PCDHGA4,PCDHGB2,PCDHGA5,PCDHGB3,PCDHGA6,PCDHGA7,PCDHGB4,PCDHGA8,PCDHGB5,PCDHGA9,PCDHGB6,PCDHGA10,PCDHGB7,PCDHGA11,PCDHGA12,PCDHGC3,PCDHGC4,PCDHGC5"

hsun3163 commented 2 years ago

Until we find a better way to solve this from the missmethyl package, let's just output the cpg beta and M bed.gz files by their original position and make a companion file documenting the annotation with gene_id