AllenWLynch / lisa

MIT License
16 stars 9 forks source link

what can I do when I have several bulk ATAC samples #2

Closed shangguandong1996 closed 3 years ago

shangguandong1996 commented 3 years ago

Hi, I am trying use lisa to my own ATAC samples. Now I have a Nm matrix and a input gene list, which N represent N ATAC peak region and M reprsent M samples. I am wordeing whether I can apply lisa in more one samples. I noticed that lisa offer the FromRegions* test, but I find it seems that it only works for one sample according to your user_guide. And I find in origin paper, Lisa use the L1 regression to find related samples instead of dealing with my own matched samples. So I am wondering whether you can give me some advices about how to use lisa to serveral samples.

And thanks for this wonderful tool.

Best wishes Guandong Shang

AllenWLynch commented 3 years ago

Hi Guandong,

Thanks for reaching out! The FromRegions test is new since the paper, actually. So the paper version describes the FromGenes test, which assumes the user only has an input genelist, and uses our CistromeDB to try to find DNAse samples to guess what accessibility looks like in that system. We added the FromRegions test for exactly your situation, when the user can supply their own accessibility, so LISA skips over the L1 regression step.

The FromRegions test can only find influential TFs one sample at a time, but if you run it separately on each ATAC sample with the same genelist, you will see the ranking of every TF. You can then compare TF p-values between samples. I’ve found it is very informative to plot the -log10(p-values) of TFs from one sample vs another to see if there are regulators that are specific to one condition, like so:

@.***

This example compares the -log10(p-values) of factors on two different genelists (T-cell unregulated genes vs down-regulated), but if you keep the same genelist, you can compare two different ATAC-seq conditions instead.

Let me know if I answered your question or if you have any more. Thanks for trying out LISA!

Allen

On May 11, 2021, at 3:52 AM, Shawn_Shang @.**@.>> wrote:

Hi, I am trying use lisa to my own ATAC samples. Now I have a N*m matrix and a input gene list, which N represent N ATAC peak region and M reprsent M samples. I am wordeing whether I can apply lisa in more one samples. I noticed that lisa offer the FromRegions test, but I find it seems that it only works for one sample according to your user_guide. And I find in origin paper, Lisa use the L1 regression to find related samples instead of dealing with my own matched samples. So I am wondering whether you can give me some advices about how to use lisa to serveral samples.

And thanks for this wonderful tool.

Best wishes Guandong Shang

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fissues%2F2&data=04%7C01%7C%7Cbbd63bac37274f6d4ebe08d91451a92b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637563163221500847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cm6fD%2BknyrpTG47d9edWdSK5XLjKvyx7HvPyM63MpQk%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE43JPEM2ZMY7R257U464ODTNDO2BANCNFSM44UUQMRQ&data=04%7C01%7C%7Cbbd63bac37274f6d4ebe08d91451a92b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637563163221510799%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GUHquWil3SMtH5L9D9yGO7F4UblSGS%2BYVRINsh9nAbw%3D&reserved=0.

shangguandong1996 commented 3 years ago

Thanks for your reply, Allen. That's a good idea for run separately on each ATAC sample. I am wondering whether I can use the L1 or ridge logistics regression for my own samples, which can help find the weight of each my mathced ATAC samples on my gene list. Then I can use the weight to identify which influential TFs results is more suitable for my gene list ? Or I just merge the all result according to the regression weight. Actually, my ATAC are 7 time point samples. And my gene list is a kmeans result from a matched 7 time point RNA-Seq samples. I am not sure whether above ideas is suitable.

Guandong Shang

shangguandong1996 commented 3 years ago

Hi, Allen. I just have another question. According to your FromRegion model picure The percent reprents the influence of TF. But I am wodering how to deal with the below condition, in which the whole RP is down. Then the percent will be the same between two condition.

图片

Or it just a ideal situation. After all, this is not the case for all the input gene sets.

AllenWLynch commented 3 years ago

Hi Guandong,

Currently, there isn’t a way to use L1 on your own samples, but usually the L1 just finds some useful samples with high weight and sets the other samples’ effects to zero. In your situation, since all the ATAC-seq samples are very relevant and from the same system, you are probably safe to just merge the samples with uniform weights to do the test.

That “combined” LISA test can identify important TFs in your experiment, and perhaps comparing the TF p-values on the first and last timepoints could identify if some TFs are more important to the start or end of your process.

As for your second question with the diagram, you’re right, the percent would be the same despite there being a lower overall accessibility around the gene. This situation could happen, but usually, we see accessibility around a gene leads to the rise of new peaks so new influential TFs.

-Allen

On May 11, 2021, at 10:55 PM, Shawn_Shang @.**@.>> wrote:

Hi, Allen. I just have another question. According to your FromRegion model [picure]https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fblob%2Fmaster%2Fdocs%2Fmodel_diagram.png&data=04%7C01%7C%7Ca3c32737354846cc4bb408d914f15701%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637563849037931926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mmJtKZ8Vhg6Ol9rz48b%2FqE0aZNOCH%2B2nGYNlXinFiUo%3D&reserved=0 The percent reprents the influence of TF. But I am wodering how to deal with the below condition, in which the whole RP is down. Then the percent will be the same between two condition.

[图片]https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuser-images.githubusercontent.com%2F22555126%2F117911146-d5eea400-b30f-11eb-942f-3c7024cd2ebb.png&data=04%7C01%7C%7Ca3c32737354846cc4bb408d914f15701%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637563849037941887%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Gr0L%2Fcx6ODlu9wOruchLvPOXBkQeRgC3kttgCfaFgz0%3D&reserved=0

Or it just a ideal situation. After all, this is not the case for all the input gene sets.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fissues%2F2%23issuecomment-839393893&data=04%7C01%7C%7Ca3c32737354846cc4bb408d914f15701%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637563849037941887%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=59WPRNnT3wohWj7%2Fd9FC2INd30rn%2FtmRJOn2VdkrQhE%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE43JPBX3HNZ7I66FADARJ3TNHUYNANCNFSM44UUQMRQ&data=04%7C01%7C%7Ca3c32737354846cc4bb408d914f15701%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637563849037951839%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kQHmKPCskXdsuZWvWL8UYEZhWtPhcGCbo%2Bn%2FqsQklPw%3D&reserved=0.

shangguandong1996 commented 3 years ago

Hi, Allen. your mention about new peaks reminds me of a question. Shoud I use the merge peak set or just separate peak for separate samples. Actually, my N*m peak count matrix is normalized by DESeq2, which I believe may be more suitable compared with just CPM or some others normalized ways according to sequence depth. But the merge peak set will not find the the rise of new peaks, after all, all samples have same peak number. While separate peaks can not be normalized by DEseq2 or others.

Please forgive me if I do not clearly express my question :).

Best wishes Guandong Shang

AllenWLynch commented 3 years ago

No worries,

I haven't used DESeq for ATAC-seq normalization so I don't really know the properties. With the merged peakset, you could still provide the scores for each sample in terms of its normalized accessibility for each peak (the LISA regions test can take a "region_scores" parameter, which, for the same set of peaks, can reweight each peak for how accessible it is in the currently-tested sample). Sometimes that score may be zero if a peak wasn't open in one sample, while it was in another, which would lead to that "emergence" I talked about. Otherwise, if you don't use the region_scores parameter, you would have to provide only the peaks specific to each sample in order to observe differences between them. Does this answer you question?

Allen


From: Shawn_Shang @.> Sent: Wednesday, May 12, 2021 8:21 PM To: AllenWLynch/lisa @.> Cc: AllenWLynch @.>; Comment @.> Subject: Re: [AllenWLynch/lisa] what can I do when I have several bulk ATAC samples (#2)

Hi, Allen. your mention about new peaks reminds me of a question. Shoud I use the merge peak set or just separate peak for separate samples. Actually, my N*m peak count matrix is normalized by DESeq2, which I believe may be more suitable compared with just CPM or some others normalized ways according to sequence depth. But the merge peak set will not find the the rise of new peaks, after all, all samples have same peak number. While separate peaks can not be normalized by DEseq2 or others.

Please forgive me if I do not clearly express my question :).

Best wishes Guandong Shang

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fissues%2F2%23issuecomment-840215360&data=04%7C01%7C%7C862e735f64f747e1319708d915ad7674%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637564657021564693%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jZ0BLieplbr%2F2YL3PGZ37M8QJMUTfXD69WF4mdF6fOk%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE43JPAWXDC3GS54WDUNEJLTNMSSJANCNFSM44UUQMRQ&data=04%7C01%7C%7C862e735f64f747e1319708d915ad7674%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637564657021574651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3F2iDiM6taBQuuHEShIeE5zHAsE49U6VG2kafCle%2BQM%3D&reserved=0.

shangguandong1996 commented 3 years ago

Thanks for your reply, Allen. It is pleasure to talk with you :) But I just thought a case. For example, most of input genes may only have one peak in their proximal, which take up most TF percent . And even peak A's score is 1e5 in sample 1 while peak A's score is 1 in sample 2. Then TF percent in this peak A associated gene is same in sample1 and sample2. Because these genes only have one influential peak. And no matter how this peak changes, the final TF percent is same, always near 100%. Then the TF p-value will be similar in sample1 and sample2, even two sample is different.

Please forgive me if I misunderstand something :)

Guandong Shang

AllenWLynch commented 3 years ago

So sorry for the late response! I flagged your message then forgot to respond after the weekend. You are correct, in that situation, the TF would not change in influence. LISA assumes we don't know anything about what the chromatin looks like in other situations, since we our goal was to say, given a situation, what are the top TFs.

Luckily, these types of problems are why we constrain LISA to use sets of genes, since pooling the TF influence across multiple genes vs the background should mitigate most of these gene-by-gene issues.

I haven't seen ATAC seq with most genes only having one proximal peak, is this really your case? If you have only a few thousand peaks across the genome, LISA may not work well for the reason you provided. In this situation, I would decrease the stringency of your peak calling, or increase the "rp_decay" paramter in LISA to like 20000 (so peaks 20Kb away will have 1/2 the influence of peaks on the promoter), which will expand the area-of-influence around each gene to include more peaks!

You are asking great questions!

Allen


From: Shawn_Shang @.> Sent: Saturday, May 15, 2021 3:25 AM To: AllenWLynch/lisa @.> Cc: AllenWLynch @.>; Comment @.> Subject: Re: [AllenWLynch/lisa] what can I do when I have several bulk ATAC samples (#2)

Thanks for your reply, Allen. It is pleasure to talk with you :) But I just thought a case. For example, most of input genes may only have one peak in their proximal, which take up most TF percent . And even peak A's score is 1e5 in sample 1 while peak A's score is 1 in sample 2. Then TF percent in this peak A associated gene is same in sample1 and sample2. Because these genes only have one influential peak. And no matter how this peak changes, the final TF percent is same, always near 100%. Then the TF p-value will be similar in sample1 and sample2, even two sample is different.

Please forgive me if I misunderstand something :)

Guandong Shang

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fissues%2F2%23issuecomment-841621990&data=04%7C01%7C%7Ce4bc42d1ac6b4b82535608d9177b080f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637566639447190035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=G%2F20%2B0PUhSN74aA9CP1oxLMwSblPCB50j%2B%2FOY5HdB1k%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE43JPEFED6UQ32LONW7G2TTNYVYNANCNFSM44UUQMRQ&data=04%7C01%7C%7Ce4bc42d1ac6b4b82535608d9177b080f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637566639447199988%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jEgErDgyL9UXIbz7oY0BQ9ocw4MrUGA33Ftd6DUWZJo%3D&reserved=0.

shangguandong1996 commented 3 years ago

Thanks for your reply, Allen :). Actually, the "one proximal peak" are just my thinking :). The rp_decay may be a good option for me. And I am wondering if the TF rank result is extremely similar between two different condition, dose it means that my input gene set is not related with the differences between two condition. For example, these two condition have 1000+ diff peak while no gene in my ipnut gene sets are located near this diff peak.

By the way, I have another question :) (hhh, please forgive me). I noticed that your background gene select is random. So is it worth me to run Lisa several times, then combind p-value to get more robust result ? or I just trust one time's result ?

Best wishes Guandong Shang

AllenWLynch commented 3 years ago

Hi Guandong,

In that case, RP decay sounds like it should be what you're looking for. As far as very similar rankings, it's possible that there isn't really a significant change in accessibility with respect to the TFs across all of your genes, but some may be more relevant to "early" accessibility and some to "late" accessibility. If there is a way to split your genelist into smaller interesting groups you may see more specific TF modules.

For the random background question, if you are using the "regions" test, I think I have the default set to all genes in the "background_strategy" parameter. Sorry that wasn't written in the docs, so you wouldn't have to worry about running the test multiple times!

Are your TF rankings essentially on a line? Like no change in influence?

Allen


From: Shawn_Shang @.> Sent: Wednesday, May 19, 2021 8:52 PM To: AllenWLynch/lisa @.> Cc: AllenWLynch @.>; Comment @.> Subject: Re: [AllenWLynch/lisa] what can I do when I have several bulk ATAC samples (#2)

Thanks for your reply, Allen :). Actually, the "one proximal peak" are just my thinking :). The rp_decay may be a good option for me. And I am wondering if the TF rank result is extremely similar between two different condition, dose it means that my input gene set is not related with the differences between two condition. For example, these two condition have 1000+ diff peak while no gene in my ipnut gene sets are located near this diff peak.

By the way, I have another question :) (hhh, please forgive me). I noticed that your background gene select is random. So is it worth me to run Lisa several times, then combind p-value to get more robust result ? or I just trust one time's result ?

Best wishes Guandong Shang

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fissues%2F2%23issuecomment-844620526&data=04%7C01%7C%7C409aa89ae0b54383997108d91b31f261%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637570723592552427%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=erDbfh6wPlDqlo6SGNtDeS%2FaY1V4I%2BNYe4n0INDRVlA%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE43JPBFAZEMBTWBVP3YIWTTORTOLANCNFSM44UUQMRQ&data=04%7C01%7C%7C409aa89ae0b54383997108d91b31f261%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637570723592562383%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=IYPi55G80Yw2mljL16J5vESLwJAlyheZIZPcIEFfvqE%3D&reserved=0.

shangguandong1996 commented 3 years ago

Hi, Allen. Actually, my species is plants :). I have tried to modify lisa so that I can use in my data. But I find it is hard for me to modify python code. I am a Python rookie╮(╯_╰)╭, so I am rewriting lisa's idea into R code. The extreme similar TF rankings between conditions and above thinking appears while writing R code. I am not sure whether my code have some probelm. I will check my code again.

Thanks again for your reply, it is a pleasure to talks with you. And thanks for this powerful tool, I have learnd a lot from it.

Best wishes Guandong Shang

shangguandong1996 commented 3 years ago

Hi, Allen. sorry I closed this isusse…… I have checked my code again, I think maybe there is no problem about my code. But I am not fully total sure about it, because I am not family with python code, and I have no ATAC about human in hands. so I can not validate my data result using lisa. But for the extreme similar TF rankings, I have a idea.

  1. we have to admit when the numbers of input genes and background genes are large, the final p-value will be so significant.
  2. when we check TF A percent in input genes and background genes, maybe 50 % input genes were hit by TF A while the percent of background is 10%. But when doing wilcox test, even 50% input genes has no TF hit, which means their TF A percent if 0, we still inclued these 0. Because we actually do within-samples comparison. This operation will be like compared input genes count with background genes count in a RNA-Seq sample.
  3. But the final p-value is so significant that the ture diff between conditions was hidden. therefore, the TF p-value is similar between conditions.

This is a example from my result

> library(dplyr)
> allSample_percent_df <- TF_pvalue@metadata$percent_df

# I select percent TF:ARR10 and sample:Hypocotyl_callus_rep1
# I split the percent result into input_genes(N=116) and background_genes(N=3000)

> allSample_percent_df %>% 
+   filter(TF_id == "ARR10",
+          sample == "Hypocotyl_callus_rep1") %>% 
+   group_split(gene_type) -> a
> a
<list_of<
  tbl_df<
    gene_id  : character
    percent  : double
    gene_type: character
    TF_id    : character
    sample   : character
  >
>[2]>
[[1]]
# A tibble: 3,000 x 5
   gene_id   percent gene_type        TF_id sample               
   <chr>       <dbl> <chr>            <chr> <chr>                
 1 AT1G01110       0 background_genes ARR10 Hypocotyl_callus_rep1
 2 AT1G01250       0 background_genes ARR10 Hypocotyl_callus_rep1
 3 AT1G01320       0 background_genes ARR10 Hypocotyl_callus_rep1
 4 AT1G01510       0 background_genes ARR10 Hypocotyl_callus_rep1
 5 AT1G01720       0 background_genes ARR10 Hypocotyl_callus_rep1
 6 AT1G01820       0 background_genes ARR10 Hypocotyl_callus_rep1
 7 AT1G01840       0 background_genes ARR10 Hypocotyl_callus_rep1
 8 AT1G01880       0 background_genes ARR10 Hypocotyl_callus_rep1
 9 AT1G01930       0 background_genes ARR10 Hypocotyl_callus_rep1
10 AT1G02130       0 background_genes ARR10 Hypocotyl_callus_rep1
# ... with 2,990 more rows

[[2]]
# A tibble: 116 x 5
   gene_id   percent gene_type   TF_id sample               
   <chr>       <dbl> <chr>       <chr> <chr>                
 1 AT1G06080  0      input_genes ARR10 Hypocotyl_callus_rep1
 2 AT1G06350  0      input_genes ARR10 Hypocotyl_callus_rep1
 3 AT1G08890  0      input_genes ARR10 Hypocotyl_callus_rep1
 4 AT1G12100  0      input_genes ARR10 Hypocotyl_callus_rep1
 5 AT1G12110  0      input_genes ARR10 Hypocotyl_callus_rep1
 6 AT1G12740  0      input_genes ARR10 Hypocotyl_callus_rep1
 7 AT1G19220  0      input_genes ARR10 Hypocotyl_callus_rep1
 8 AT1G19790  0.611  input_genes ARR10 Hypocotyl_callus_rep1
 9 AT1G19800  0.0437 input_genes ARR10 Hypocotyl_callus_rep1
10 AT1G22250  0      input_genes ARR10 Hypocotyl_callus_rep1
# ... with 106 more rows

> wilcox.test(a[[2]]$percent, a[[1]]$percent)$p.value
[1] 1.338277e-29
> summary(a[[1]]$percent)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.000000 0.000000 0.006596 0.000000 0.996826 
> summary(a[[2]]$percent)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.00000 0.00000 0.11784 0.02504 0.99058 

# I select percent TF:ARR10 and sample:Hypocotyl_organ_rep1
# I split the percent result into input_genes(N=116) and background_genes(N=3000)

> allSample_percent_df %>% 
+   filter(TF_id == "ARR10",
+          sample == "Hypocotyl_organ_rep1") %>% 
+   group_split(gene_type) -> b
> b
<list_of<
  tbl_df<
    gene_id  : character
    percent  : double
    gene_type: character
    TF_id    : character
    sample   : character
  >
>[2]>
[[1]]
# A tibble: 3,000 x 5
   gene_id   percent gene_type        TF_id sample              
   <chr>       <dbl> <chr>            <chr> <chr>               
 1 AT1G01110       0 background_genes ARR10 Hypocotyl_organ_rep1
 2 AT1G01250       0 background_genes ARR10 Hypocotyl_organ_rep1
 3 AT1G01320       0 background_genes ARR10 Hypocotyl_organ_rep1
 4 AT1G01510       0 background_genes ARR10 Hypocotyl_organ_rep1
 5 AT1G01720       0 background_genes ARR10 Hypocotyl_organ_rep1
 6 AT1G01820       0 background_genes ARR10 Hypocotyl_organ_rep1
 7 AT1G01840       0 background_genes ARR10 Hypocotyl_organ_rep1
 8 AT1G01880       0 background_genes ARR10 Hypocotyl_organ_rep1
 9 AT1G01930       0 background_genes ARR10 Hypocotyl_organ_rep1
10 AT1G02130       0 background_genes ARR10 Hypocotyl_organ_rep1
# ... with 2,990 more rows

[[2]]
# A tibble: 116 x 5
   gene_id   percent gene_type   TF_id sample              
   <chr>       <dbl> <chr>       <chr> <chr>               
 1 AT1G06080 0       input_genes ARR10 Hypocotyl_organ_rep1
 2 AT1G06350 0       input_genes ARR10 Hypocotyl_organ_rep1
 3 AT1G08890 0       input_genes ARR10 Hypocotyl_organ_rep1
 4 AT1G12100 0       input_genes ARR10 Hypocotyl_organ_rep1
 5 AT1G12110 0       input_genes ARR10 Hypocotyl_organ_rep1
 6 AT1G12740 0       input_genes ARR10 Hypocotyl_organ_rep1
 7 AT1G19220 0       input_genes ARR10 Hypocotyl_organ_rep1
 8 AT1G19790 0.440   input_genes ARR10 Hypocotyl_organ_rep1
 9 AT1G19800 0.00611 input_genes ARR10 Hypocotyl_organ_rep1
10 AT1G22250 0       input_genes ARR10 Hypocotyl_organ_rep1
# ... with 106 more rows

# This p-value is extremely same as above sample
# even distributions of percent in two condition are different
> wilcox.test(b[[2]]$percent, b[[1]]$percent)$p.value
[1] 1.016593e-29
> summary(b[[1]]$percent)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.000000 0.000000 0.003643 0.000000 0.933759 
> summary(b[[2]]$percent)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.000000 0.000000 0.075520 0.006122 0.874863 

My input genes is from diffPeak annotated gene between Hypocotyl_organ_rep1 and Hypocotyl_callus_rep1. In other words, my input genes is related with differncen between samples. you can see sumRP of these input gene are different

> TF_result@ExperimentList$sumRP[input_genes, ]
          Hypocotyl_organ_rep1 Hypocotyl_callus_rep1
AT5G52890           101.395172            1279.54008
AT1G12740            65.606304            1810.14359
AT1G34120           203.357678             733.48946
AT5G37732           125.337949            1932.86183
AT4G37740           196.505332             597.83896
AT5G35770           250.072106             747.90960
AT3G19200            91.024816            1302.22580
AT2G04480           154.202592            2060.45948
AT2G12646            59.636535            2006.71960
AT1G77765           403.269684            1033.17154
AT3G20840            64.491156             798.64804
AT5G66700           138.845602            1322.56864
AT2G04580            37.006194             361.34312
AT3G18550            70.475219             173.04537
AT2G18120            55.255489            1470.96453
AT1G46264           216.580896            1747.93401
AT1G36940           200.729102            1026.23640
AT5G51451           766.157332            1580.79444
AT1G19800           331.937375             965.48163
AT3G47600           160.706277             798.01884
AT2G23067           170.783332            1873.77644
AT4G36260            73.170546            1654.18899
AT2G22430          1149.473905            1605.54243
AT5G53950            84.860790             884.13810
AT4G12870            55.085374             407.35588
AT5G37730            93.360562            1228.36873
AT2G19970           220.463882             902.16751
AT2G30130           210.605813             700.04511
AT3G45610            84.139922             426.08257
AT2G03090           371.458152            2058.43024
AT1G26773           359.771998             853.50694
AT5G54510           818.285377            1162.14222
AT4G34170           193.343421             224.55508
AT2G22426          1723.689926            2851.63900
AT5G28495             6.153947              82.86117
AT5G28640           215.628894             976.30948
AT3G27400           180.378939             777.77699
AT3G50870           107.845431             512.52363
AT5G42635           121.719141             712.73841
AT1G06080            53.646855             739.31661
AT3G27503           154.111367            1058.35174
AT3G06019           147.308195             381.94281
AT5G17430            49.325227            1566.92487
AT2G30120           351.861100             564.63707
AT1G26770           171.859465            1508.73037
AT1G75530           421.126297            1489.79347
AT2G18360           107.573063            1029.01661
AT1G51190            54.781158            1501.32891
AT2G46870            43.293888             232.18096
AT3G56360          1714.201893            1541.05663
AT5G25000           142.559436             483.09200
AT2G37210            85.342369             323.84704
AT1G69530           370.235634            1957.73562
AT4G29680           820.638873            1472.55516
AT4G27950           366.843554            1623.35903
AT2G25980           162.614290             404.64441
AT5G53290           951.265221            1532.84002
AT5G24320           279.088580            1013.61733
AT1G19790            58.773442             882.08374
AT5G66690           120.463772             334.81094
AT5G54400           128.879373             572.67766
AT5G56970            79.250733             507.78302
AT4G34990           474.935672            1066.73594
AT3G06010           430.163554             546.57569
AT1G65930          1260.303655            1106.56030
AT1G51175            21.145120             424.34071
AT3G61980          1001.490229             713.02154
AT3G51060            62.557936            1417.48312
AT5G22890           335.104572            1631.43850
AT2G34655           503.514844            1102.44136
AT3G62610           782.516252            1397.33820
AT5G20820           191.297901             998.91166
AT4G27960           842.437791             818.28451
AT4G08770           181.998528            1523.30662
AT4G20320           442.227176            1034.25871
AT4G27260          1324.788649            1604.03150
AT1G68360           910.856780            1292.79945
AT1G70900           508.264107             521.00466
AT3G44710            20.955488             182.01842
AT5G28960            12.805698             182.41394
AT5G28490            87.748928             664.30420
AT4G19980           161.614165            2197.52127
AT3G27480            40.991548             284.90439
AT5G57350           303.882994             497.16172
AT1G73600           251.789429             506.97451
AT5G53500           190.889014             699.05178
AT2G01990           310.712576             807.11166
AT5G50300           103.845801             762.44324
AT1G70370           599.142900             421.12218
AT5G04690           108.197990             544.91675
AT5G41315            68.564621             269.06803
AT1G69820            48.138959             151.84389
AT2G42835            16.751711             126.98859
AT3G22540           226.218095             679.63165
AT1G06350            22.100291             122.92428
AT5G20240           483.595468             537.69011
AT2G43590           259.223670             254.65521
AT1G75520           105.536454             911.40449
AT2G40070           210.501033             231.07452
AT1G19220           931.272455            1937.11997
AT1G12100           141.638170             103.76606
AT1G22250             7.397493              18.39594
AT1G08890          1007.841916             991.38129
AT5G02550           147.682265             522.35266
AT3G02242            94.518155             478.15495
AT3G05165           286.305368             238.12595
AT5G23260            55.835849             129.33527
AT3G26747           221.217016             405.30635
AT4G02910            52.271277              51.73700
AT3G20460           235.253169             422.25544
AT1G12110           182.314172            1072.27956

I believe when comparing TF A percents of input genes between condition, we should drop 0 percent genes. after all, these genes are not hit by TF A. Including these 0 percent genes will bias the final result. I think this is part of the reason why p-value of TF A is similar between two condition.

if we just compare percent of input genes between samples, we will get no significant result

> ARR_value_a <- a[[2]]
> ARR_value_b <- b[[2]]
> wilcox.test(ARR_value_a$percent, ARR_value_b$percent)

    Wilcoxon rank sum test with continuity correction

data:  ARR_value_a$percent and ARR_value_b$percent
W = 6889.5, p-value = 0.7081
alternative hypothesis: true location shift is not equal to 0

if we can throw percent 0 genes, the result will be better.

> wilcox.test(subset(ARR_value_a, percent > 0)$percent,
+             subset(ARR_value_b, percent > 0)$percent)

    Wilcoxon rank sum exact test

data:  subset(ARR_value_a, percent > 0)$percent and subset(ARR_value_b, percent > 0)$percent
W = 922, p-value = 0.1079
alternative hypothesis: true location shift is not equal to 0

Please forgive me if I confuse you or I misunderstand something :)

Best wishes

Guandong Shang

shangguandong1996 commented 3 years ago

Just another example about TF B

> TF_fullInfo_list
<list_of<
  tbl_df<
    gene_id  : character
    percent  : double
    gene_type: character
    TF_id    : character
    sample   : character
  >
>[4]>
[[1]]
# A tibble: 3,000 x 5
   gene_id   percent gene_type        TF_id sample               
   <chr>       <dbl> <chr>            <chr> <chr>                
 1 AT1G01110       0 background_genes LEC1  Hypocotyl_callus_rep1
 2 AT1G01250       0 background_genes LEC1  Hypocotyl_callus_rep1
 3 AT1G01320       0 background_genes LEC1  Hypocotyl_callus_rep1
 4 AT1G01510       0 background_genes LEC1  Hypocotyl_callus_rep1
 5 AT1G01720       0 background_genes LEC1  Hypocotyl_callus_rep1
 6 AT1G01820       0 background_genes LEC1  Hypocotyl_callus_rep1
 7 AT1G01840       0 background_genes LEC1  Hypocotyl_callus_rep1
 8 AT1G01880       0 background_genes LEC1  Hypocotyl_callus_rep1
 9 AT1G01930       0 background_genes LEC1  Hypocotyl_callus_rep1
10 AT1G02130       0 background_genes LEC1  Hypocotyl_callus_rep1
# ... with 2,990 more rows

[[2]]
# A tibble: 3,000 x 5
   gene_id   percent gene_type        TF_id sample              
   <chr>       <dbl> <chr>            <chr> <chr>               
 1 AT1G01110       0 background_genes LEC1  Hypocotyl_organ_rep1
 2 AT1G01250       0 background_genes LEC1  Hypocotyl_organ_rep1
 3 AT1G01320       0 background_genes LEC1  Hypocotyl_organ_rep1
 4 AT1G01510       0 background_genes LEC1  Hypocotyl_organ_rep1
 5 AT1G01720       0 background_genes LEC1  Hypocotyl_organ_rep1
 6 AT1G01820       0 background_genes LEC1  Hypocotyl_organ_rep1
 7 AT1G01840       0 background_genes LEC1  Hypocotyl_organ_rep1
 8 AT1G01880       0 background_genes LEC1  Hypocotyl_organ_rep1
 9 AT1G01930       0 background_genes LEC1  Hypocotyl_organ_rep1
10 AT1G02130       0 background_genes LEC1  Hypocotyl_organ_rep1
# ... with 2,990 more rows

[[3]]
# A tibble: 235 x 5
   gene_id   percent gene_type   TF_id sample               
   <chr>       <dbl> <chr>       <chr> <chr>                
 1 AT1G04443   0.562 input_genes LEC1  Hypocotyl_callus_rep1
 2 AT1G04610   0     input_genes LEC1  Hypocotyl_callus_rep1
 3 AT1G04983   0.958 input_genes LEC1  Hypocotyl_callus_rep1
 4 AT1G05153   0     input_genes LEC1  Hypocotyl_callus_rep1
 5 AT1G06080   0.923 input_genes LEC1  Hypocotyl_callus_rep1
 6 AT1G06090   0.337 input_genes LEC1  Hypocotyl_callus_rep1
 7 AT1G06350   0     input_genes LEC1  Hypocotyl_callus_rep1
 8 AT1G06990   0.685 input_genes LEC1  Hypocotyl_callus_rep1
 9 AT1G07937   0.677 input_genes LEC1  Hypocotyl_callus_rep1
10 AT1G08133   0.328 input_genes LEC1  Hypocotyl_callus_rep1
# ... with 225 more rows

[[4]]
# A tibble: 235 x 5
   gene_id   percent gene_type   TF_id sample              
   <chr>       <dbl> <chr>       <chr> <chr>               
 1 AT1G04443  0.0848 input_genes LEC1  Hypocotyl_organ_rep1
 2 AT1G04610  0      input_genes LEC1  Hypocotyl_organ_rep1
 3 AT1G04983  0.608  input_genes LEC1  Hypocotyl_organ_rep1
 4 AT1G05153  0      input_genes LEC1  Hypocotyl_organ_rep1
 5 AT1G06080  0.400  input_genes LEC1  Hypocotyl_organ_rep1
 6 AT1G06090  0.0978 input_genes LEC1  Hypocotyl_organ_rep1
 7 AT1G06350  0      input_genes LEC1  Hypocotyl_organ_rep1
 8 AT1G06990  0.181  input_genes LEC1  Hypocotyl_organ_rep1
 9 AT1G07937  0.255  input_genes LEC1  Hypocotyl_organ_rep1
10 AT1G08133  0.0998 input_genes LEC1  Hypocotyl_organ_rep1
# ... with 225 more rows

# Hypocotyl_callus_rep1
# input genes vs background genes
> wilcox.test(TF_fullInfo_list[[3]]$percent, TF_fullInfo_list[[1]]$percent, alternative = "greater")$p.value
[1] 4.148777e-156

# Hypocotyl_organ_rep1 
# input genes vs backaground genes
> wilcox.test(TF_fullInfo_list[[4]]$percent, TF_fullInfo_list[[2]]$percent, alternative = "greater")$p.value
[1] 1.176709e-154

you can see two result looks similar, but the distribution of percent is very different

> summary(TF_fullInfo_list[[3]]$percent)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.07185 0.41670 0.44498 0.82030 1.00000 
> summary(TF_fullInfo_list[[4]]$percent)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.01147 0.11733 0.26979 0.49835 1.00000 

# it seems that TF B has more influnec on input genes in # Hypocotyl_callus_rep1
> median(TF_fullInfo_list[[3]]$percent) - median(TF_fullInfo_list[[1]]$percent)
[1] 0.4167038

> median(TF_fullInfo_list[[4]]$percent) - median(TF_fullInfo_list[[2]]$percent)
[1] 0.1173348

Best wishes Guandong Shang

shangguandong1996 commented 3 years ago

Sorry I misused test data in above example, but the final conclusion are same. The TF p-value or rank is robust in different condition when input genes are same. But I did find some thing interesting about wilcox.test in https://github.com/AllenWLynch/lisa/issues/3

AllenWLynch commented 3 years ago

Hi Guandong,

I see the issue with the wilcoxon test in this example, and it may be a result of the sparseness of your proximal regions compared to human/mouse samples, where largely ISD scores are > 0, atleast for most query genes. Keeping zeros between query and background genes and removing them ask two different questions. Keeping zeros asks are query genes more likely to be influenced by this TF's binding than background genes (the core LISA question), while removing the zeros asks "given genes that may be influenced by this TF, are query genes more influenced than background genes". I actually think both questions are distinct, but meaningful, and will produce different results for TF p-values. In your case, it is your discretion which question you answer, but it appears that the later may be more appropriate.

This is a really interesting find. Also, I admire your resolved to implement the code in R! Do you have your own ChIP-seq examples as well?

Allen

shangguandong1996 commented 3 years ago

Hi, Allen Thanks, I believe sparseness of the proximal regions is the the answer of similar p-value !!! My species is Arabidopsis thaliana, which have 3w+ genes while only 2-3w peak. Its genome is only 100+ MB. And the decay dist I set is 1000bp, which I believe is more suitable for Arabidopsis thaliana. After all, there is no significant reason shows that a general long-distance interaction appears on At's genome. Indeed, if I set 1000bp as the decay dist, condiser the small genome and peak numebr, most gene may only have "one or two proximal peak". So it will makes most query genes have score > 0, and the percent will not change even gene's RP have changed a lot. As for the ChIP-Seq data, I am very "jealous" that humans/mouse have good databases like Cistrome DB :). Even Arabidopsis thaliana is the model system for plant, its database for TF ChIP-Seq is poor. So most time the data I used for test is the result of motif-scan for my ATAC-peak.

Best wishes Guandong Shang