Closed shangguandong1996 closed 3 years ago
Hi Guandong,
Thanks for reaching out! The FromRegions test is new since the paper, actually. So the paper version describes the FromGenes test, which assumes the user only has an input genelist, and uses our CistromeDB to try to find DNAse samples to guess what accessibility looks like in that system. We added the FromRegions test for exactly your situation, when the user can supply their own accessibility, so LISA skips over the L1 regression step.
The FromRegions test can only find influential TFs one sample at a time, but if you run it separately on each ATAC sample with the same genelist, you will see the ranking of every TF. You can then compare TF p-values between samples. I’ve found it is very informative to plot the -log10(p-values) of TFs from one sample vs another to see if there are regulators that are specific to one condition, like so:
@.***
This example compares the -log10(p-values) of factors on two different genelists (T-cell unregulated genes vs down-regulated), but if you keep the same genelist, you can compare two different ATAC-seq conditions instead.
Let me know if I answered your question or if you have any more. Thanks for trying out LISA!
Allen
On May 11, 2021, at 3:52 AM, Shawn_Shang @.**@.>> wrote:
Hi, I am trying use lisa to my own ATAC samples. Now I have a N*m matrix and a input gene list, which N represent N ATAC peak region and M reprsent M samples. I am wordeing whether I can apply lisa in more one samples. I noticed that lisa offer the FromRegions test, but I find it seems that it only works for one sample according to your user_guide. And I find in origin paper, Lisa use the L1 regression to find related samples instead of dealing with my own matched samples. So I am wondering whether you can give me some advices about how to use lisa to serveral samples.
And thanks for this wonderful tool.
Best wishes Guandong Shang
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fissues%2F2&data=04%7C01%7C%7Cbbd63bac37274f6d4ebe08d91451a92b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637563163221500847%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=cm6fD%2BknyrpTG47d9edWdSK5XLjKvyx7HvPyM63MpQk%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE43JPEM2ZMY7R257U464ODTNDO2BANCNFSM44UUQMRQ&data=04%7C01%7C%7Cbbd63bac37274f6d4ebe08d91451a92b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637563163221510799%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GUHquWil3SMtH5L9D9yGO7F4UblSGS%2BYVRINsh9nAbw%3D&reserved=0.
Thanks for your reply, Allen. That's a good idea for run separately on each ATAC sample. I am wondering whether I can use the L1 or ridge logistics regression for my own samples, which can help find the weight of each my mathced ATAC samples on my gene list. Then I can use the weight to identify which influential TFs results is more suitable for my gene list ? Or I just merge the all result according to the regression weight. Actually, my ATAC are 7 time point samples. And my gene list is a kmeans result from a matched 7 time point RNA-Seq samples. I am not sure whether above ideas is suitable.
Guandong Shang
Hi, Allen. I just have another question. According to your FromRegion model The percent reprents the influence of TF. But I am wodering how to deal with the below condition, in which the whole RP is down. Then the percent will be the same between two condition.
Or it just a ideal situation. After all, this is not the case for all the input gene sets.
Hi Guandong,
Currently, there isn’t a way to use L1 on your own samples, but usually the L1 just finds some useful samples with high weight and sets the other samples’ effects to zero. In your situation, since all the ATAC-seq samples are very relevant and from the same system, you are probably safe to just merge the samples with uniform weights to do the test.
That “combined” LISA test can identify important TFs in your experiment, and perhaps comparing the TF p-values on the first and last timepoints could identify if some TFs are more important to the start or end of your process.
As for your second question with the diagram, you’re right, the percent would be the same despite there being a lower overall accessibility around the gene. This situation could happen, but usually, we see accessibility around a gene leads to the rise of new peaks so new influential TFs.
-Allen
On May 11, 2021, at 10:55 PM, Shawn_Shang @.**@.>> wrote:
Hi, Allen. I just have another question. According to your FromRegion model [picure]https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fblob%2Fmaster%2Fdocs%2Fmodel_diagram.png&data=04%7C01%7C%7Ca3c32737354846cc4bb408d914f15701%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637563849037931926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mmJtKZ8Vhg6Ol9rz48b%2FqE0aZNOCH%2B2nGYNlXinFiUo%3D&reserved=0 The percent reprents the influence of TF. But I am wodering how to deal with the below condition, in which the whole RP is down. Then the percent will be the same between two condition.
Or it just a ideal situation. After all, this is not the case for all the input gene sets.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fissues%2F2%23issuecomment-839393893&data=04%7C01%7C%7Ca3c32737354846cc4bb408d914f15701%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637563849037941887%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=59WPRNnT3wohWj7%2Fd9FC2INd30rn%2FtmRJOn2VdkrQhE%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE43JPBX3HNZ7I66FADARJ3TNHUYNANCNFSM44UUQMRQ&data=04%7C01%7C%7Ca3c32737354846cc4bb408d914f15701%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637563849037951839%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=kQHmKPCskXdsuZWvWL8UYEZhWtPhcGCbo%2Bn%2FqsQklPw%3D&reserved=0.
Hi, Allen. your mention about new peaks reminds me of a question. Shoud I use the merge peak set or just separate peak for separate samples. Actually, my N*m peak count matrix is normalized by DESeq2, which I believe may be more suitable compared with just CPM or some others normalized ways according to sequence depth. But the merge peak set will not find the the rise of new peaks, after all, all samples have same peak number. While separate peaks can not be normalized by DEseq2 or others.
Please forgive me if I do not clearly express my question :).
Best wishes Guandong Shang
No worries,
I haven't used DESeq for ATAC-seq normalization so I don't really know the properties. With the merged peakset, you could still provide the scores for each sample in terms of its normalized accessibility for each peak (the LISA regions test can take a "region_scores" parameter, which, for the same set of peaks, can reweight each peak for how accessible it is in the currently-tested sample). Sometimes that score may be zero if a peak wasn't open in one sample, while it was in another, which would lead to that "emergence" I talked about. Otherwise, if you don't use the region_scores parameter, you would have to provide only the peaks specific to each sample in order to observe differences between them. Does this answer you question?
Allen
From: Shawn_Shang @.> Sent: Wednesday, May 12, 2021 8:21 PM To: AllenWLynch/lisa @.> Cc: AllenWLynch @.>; Comment @.> Subject: Re: [AllenWLynch/lisa] what can I do when I have several bulk ATAC samples (#2)
Hi, Allen. your mention about new peaks reminds me of a question. Shoud I use the merge peak set or just separate peak for separate samples. Actually, my N*m peak count matrix is normalized by DESeq2, which I believe may be more suitable compared with just CPM or some others normalized ways according to sequence depth. But the merge peak set will not find the the rise of new peaks, after all, all samples have same peak number. While separate peaks can not be normalized by DEseq2 or others.
Please forgive me if I do not clearly express my question :).
Best wishes Guandong Shang
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fissues%2F2%23issuecomment-840215360&data=04%7C01%7C%7C862e735f64f747e1319708d915ad7674%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637564657021564693%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jZ0BLieplbr%2F2YL3PGZ37M8QJMUTfXD69WF4mdF6fOk%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE43JPAWXDC3GS54WDUNEJLTNMSSJANCNFSM44UUQMRQ&data=04%7C01%7C%7C862e735f64f747e1319708d915ad7674%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637564657021574651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3F2iDiM6taBQuuHEShIeE5zHAsE49U6VG2kafCle%2BQM%3D&reserved=0.
Thanks for your reply, Allen. It is pleasure to talk with you :) But I just thought a case. For example, most of input genes may only have one peak in their proximal, which take up most TF percent . And even peak A's score is 1e5 in sample 1 while peak A's score is 1 in sample 2. Then TF percent in this peak A associated gene is same in sample1 and sample2. Because these genes only have one influential peak. And no matter how this peak changes, the final TF percent is same, always near 100%. Then the TF p-value will be similar in sample1 and sample2, even two sample is different.
Please forgive me if I misunderstand something :)
Guandong Shang
So sorry for the late response! I flagged your message then forgot to respond after the weekend. You are correct, in that situation, the TF would not change in influence. LISA assumes we don't know anything about what the chromatin looks like in other situations, since we our goal was to say, given a situation, what are the top TFs.
Luckily, these types of problems are why we constrain LISA to use sets of genes, since pooling the TF influence across multiple genes vs the background should mitigate most of these gene-by-gene issues.
I haven't seen ATAC seq with most genes only having one proximal peak, is this really your case? If you have only a few thousand peaks across the genome, LISA may not work well for the reason you provided. In this situation, I would decrease the stringency of your peak calling, or increase the "rp_decay" paramter in LISA to like 20000 (so peaks 20Kb away will have 1/2 the influence of peaks on the promoter), which will expand the area-of-influence around each gene to include more peaks!
You are asking great questions!
Allen
From: Shawn_Shang @.> Sent: Saturday, May 15, 2021 3:25 AM To: AllenWLynch/lisa @.> Cc: AllenWLynch @.>; Comment @.> Subject: Re: [AllenWLynch/lisa] what can I do when I have several bulk ATAC samples (#2)
Thanks for your reply, Allen. It is pleasure to talk with you :) But I just thought a case. For example, most of input genes may only have one peak in their proximal, which take up most TF percent . And even peak A's score is 1e5 in sample 1 while peak A's score is 1 in sample 2. Then TF percent in this peak A associated gene is same in sample1 and sample2. Because these genes only have one influential peak. And no matter how this peak changes, the final TF percent is same, always near 100%. Then the TF p-value will be similar in sample1 and sample2, even two sample is different.
Please forgive me if I misunderstand something :)
Guandong Shang
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fissues%2F2%23issuecomment-841621990&data=04%7C01%7C%7Ce4bc42d1ac6b4b82535608d9177b080f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637566639447190035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=G%2F20%2B0PUhSN74aA9CP1oxLMwSblPCB50j%2B%2FOY5HdB1k%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE43JPEFED6UQ32LONW7G2TTNYVYNANCNFSM44UUQMRQ&data=04%7C01%7C%7Ce4bc42d1ac6b4b82535608d9177b080f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637566639447199988%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jEgErDgyL9UXIbz7oY0BQ9ocw4MrUGA33Ftd6DUWZJo%3D&reserved=0.
Thanks for your reply, Allen :). Actually, the "one proximal peak" are just my thinking :). The rp_decay may be a good option for me. And I am wondering if the TF rank result is extremely similar between two different condition, dose it means that my input gene set is not related with the differences between two condition. For example, these two condition have 1000+ diff peak while no gene in my ipnut gene sets are located near this diff peak.
By the way, I have another question :) (hhh, please forgive me). I noticed that your background gene select is random. So is it worth me to run Lisa several times, then combind p-value to get more robust result ? or I just trust one time's result ?
Best wishes Guandong Shang
Hi Guandong,
In that case, RP decay sounds like it should be what you're looking for. As far as very similar rankings, it's possible that there isn't really a significant change in accessibility with respect to the TFs across all of your genes, but some may be more relevant to "early" accessibility and some to "late" accessibility. If there is a way to split your genelist into smaller interesting groups you may see more specific TF modules.
For the random background question, if you are using the "regions" test, I think I have the default set to all genes in the "background_strategy" parameter. Sorry that wasn't written in the docs, so you wouldn't have to worry about running the test multiple times!
Are your TF rankings essentially on a line? Like no change in influence?
Allen
From: Shawn_Shang @.> Sent: Wednesday, May 19, 2021 8:52 PM To: AllenWLynch/lisa @.> Cc: AllenWLynch @.>; Comment @.> Subject: Re: [AllenWLynch/lisa] what can I do when I have several bulk ATAC samples (#2)
Thanks for your reply, Allen :). Actually, the "one proximal peak" are just my thinking :). The rp_decay may be a good option for me. And I am wondering if the TF rank result is extremely similar between two different condition, dose it means that my input gene set is not related with the differences between two condition. For example, these two condition have 1000+ diff peak while no gene in my ipnut gene sets are located near this diff peak.
By the way, I have another question :) (hhh, please forgive me). I noticed that your background gene select is random. So is it worth me to run Lisa several times, then combind p-value to get more robust result ? or I just trust one time's result ?
Best wishes Guandong Shang
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAllenWLynch%2Flisa%2Fissues%2F2%23issuecomment-844620526&data=04%7C01%7C%7C409aa89ae0b54383997108d91b31f261%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637570723592552427%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=erDbfh6wPlDqlo6SGNtDeS%2FaY1V4I%2BNYe4n0INDRVlA%3D&reserved=0, or unsubscribehttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAE43JPBFAZEMBTWBVP3YIWTTORTOLANCNFSM44UUQMRQ&data=04%7C01%7C%7C409aa89ae0b54383997108d91b31f261%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637570723592562383%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=IYPi55G80Yw2mljL16J5vESLwJAlyheZIZPcIEFfvqE%3D&reserved=0.
Hi, Allen. Actually, my species is plants :). I have tried to modify lisa so that I can use in my data. But I find it is hard for me to modify python code. I am a Python rookie╮(╯_╰)╭, so I am rewriting lisa's idea into R code. The extreme similar TF rankings between conditions and above thinking appears while writing R code. I am not sure whether my code have some probelm. I will check my code again.
Thanks again for your reply, it is a pleasure to talks with you. And thanks for this powerful tool, I have learnd a lot from it.
Best wishes Guandong Shang
Hi, Allen. sorry I closed this isusse…… I have checked my code again, I think maybe there is no problem about my code. But I am not fully total sure about it, because I am not family with python code, and I have no ATAC about human in hands. so I can not validate my data result using lisa. But for the extreme similar TF rankings, I have a idea.
This is a example from my result
> library(dplyr)
> allSample_percent_df <- TF_pvalue@metadata$percent_df
# I select percent TF:ARR10 and sample:Hypocotyl_callus_rep1
# I split the percent result into input_genes(N=116) and background_genes(N=3000)
> allSample_percent_df %>%
+ filter(TF_id == "ARR10",
+ sample == "Hypocotyl_callus_rep1") %>%
+ group_split(gene_type) -> a
> a
<list_of<
tbl_df<
gene_id : character
percent : double
gene_type: character
TF_id : character
sample : character
>
>[2]>
[[1]]
# A tibble: 3,000 x 5
gene_id percent gene_type TF_id sample
<chr> <dbl> <chr> <chr> <chr>
1 AT1G01110 0 background_genes ARR10 Hypocotyl_callus_rep1
2 AT1G01250 0 background_genes ARR10 Hypocotyl_callus_rep1
3 AT1G01320 0 background_genes ARR10 Hypocotyl_callus_rep1
4 AT1G01510 0 background_genes ARR10 Hypocotyl_callus_rep1
5 AT1G01720 0 background_genes ARR10 Hypocotyl_callus_rep1
6 AT1G01820 0 background_genes ARR10 Hypocotyl_callus_rep1
7 AT1G01840 0 background_genes ARR10 Hypocotyl_callus_rep1
8 AT1G01880 0 background_genes ARR10 Hypocotyl_callus_rep1
9 AT1G01930 0 background_genes ARR10 Hypocotyl_callus_rep1
10 AT1G02130 0 background_genes ARR10 Hypocotyl_callus_rep1
# ... with 2,990 more rows
[[2]]
# A tibble: 116 x 5
gene_id percent gene_type TF_id sample
<chr> <dbl> <chr> <chr> <chr>
1 AT1G06080 0 input_genes ARR10 Hypocotyl_callus_rep1
2 AT1G06350 0 input_genes ARR10 Hypocotyl_callus_rep1
3 AT1G08890 0 input_genes ARR10 Hypocotyl_callus_rep1
4 AT1G12100 0 input_genes ARR10 Hypocotyl_callus_rep1
5 AT1G12110 0 input_genes ARR10 Hypocotyl_callus_rep1
6 AT1G12740 0 input_genes ARR10 Hypocotyl_callus_rep1
7 AT1G19220 0 input_genes ARR10 Hypocotyl_callus_rep1
8 AT1G19790 0.611 input_genes ARR10 Hypocotyl_callus_rep1
9 AT1G19800 0.0437 input_genes ARR10 Hypocotyl_callus_rep1
10 AT1G22250 0 input_genes ARR10 Hypocotyl_callus_rep1
# ... with 106 more rows
> wilcox.test(a[[2]]$percent, a[[1]]$percent)$p.value
[1] 1.338277e-29
> summary(a[[1]]$percent)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000000 0.000000 0.000000 0.006596 0.000000 0.996826
> summary(a[[2]]$percent)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.00000 0.00000 0.11784 0.02504 0.99058
# I select percent TF:ARR10 and sample:Hypocotyl_organ_rep1
# I split the percent result into input_genes(N=116) and background_genes(N=3000)
> allSample_percent_df %>%
+ filter(TF_id == "ARR10",
+ sample == "Hypocotyl_organ_rep1") %>%
+ group_split(gene_type) -> b
> b
<list_of<
tbl_df<
gene_id : character
percent : double
gene_type: character
TF_id : character
sample : character
>
>[2]>
[[1]]
# A tibble: 3,000 x 5
gene_id percent gene_type TF_id sample
<chr> <dbl> <chr> <chr> <chr>
1 AT1G01110 0 background_genes ARR10 Hypocotyl_organ_rep1
2 AT1G01250 0 background_genes ARR10 Hypocotyl_organ_rep1
3 AT1G01320 0 background_genes ARR10 Hypocotyl_organ_rep1
4 AT1G01510 0 background_genes ARR10 Hypocotyl_organ_rep1
5 AT1G01720 0 background_genes ARR10 Hypocotyl_organ_rep1
6 AT1G01820 0 background_genes ARR10 Hypocotyl_organ_rep1
7 AT1G01840 0 background_genes ARR10 Hypocotyl_organ_rep1
8 AT1G01880 0 background_genes ARR10 Hypocotyl_organ_rep1
9 AT1G01930 0 background_genes ARR10 Hypocotyl_organ_rep1
10 AT1G02130 0 background_genes ARR10 Hypocotyl_organ_rep1
# ... with 2,990 more rows
[[2]]
# A tibble: 116 x 5
gene_id percent gene_type TF_id sample
<chr> <dbl> <chr> <chr> <chr>
1 AT1G06080 0 input_genes ARR10 Hypocotyl_organ_rep1
2 AT1G06350 0 input_genes ARR10 Hypocotyl_organ_rep1
3 AT1G08890 0 input_genes ARR10 Hypocotyl_organ_rep1
4 AT1G12100 0 input_genes ARR10 Hypocotyl_organ_rep1
5 AT1G12110 0 input_genes ARR10 Hypocotyl_organ_rep1
6 AT1G12740 0 input_genes ARR10 Hypocotyl_organ_rep1
7 AT1G19220 0 input_genes ARR10 Hypocotyl_organ_rep1
8 AT1G19790 0.440 input_genes ARR10 Hypocotyl_organ_rep1
9 AT1G19800 0.00611 input_genes ARR10 Hypocotyl_organ_rep1
10 AT1G22250 0 input_genes ARR10 Hypocotyl_organ_rep1
# ... with 106 more rows
# This p-value is extremely same as above sample
# even distributions of percent in two condition are different
> wilcox.test(b[[2]]$percent, b[[1]]$percent)$p.value
[1] 1.016593e-29
> summary(b[[1]]$percent)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000000 0.000000 0.000000 0.003643 0.000000 0.933759
> summary(b[[2]]$percent)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000000 0.000000 0.000000 0.075520 0.006122 0.874863
My input genes is from diffPeak annotated gene between Hypocotyl_organ_rep1 and Hypocotyl_callus_rep1. In other words, my input genes is related with differncen between samples. you can see sumRP of these input gene are different
> TF_result@ExperimentList$sumRP[input_genes, ]
Hypocotyl_organ_rep1 Hypocotyl_callus_rep1
AT5G52890 101.395172 1279.54008
AT1G12740 65.606304 1810.14359
AT1G34120 203.357678 733.48946
AT5G37732 125.337949 1932.86183
AT4G37740 196.505332 597.83896
AT5G35770 250.072106 747.90960
AT3G19200 91.024816 1302.22580
AT2G04480 154.202592 2060.45948
AT2G12646 59.636535 2006.71960
AT1G77765 403.269684 1033.17154
AT3G20840 64.491156 798.64804
AT5G66700 138.845602 1322.56864
AT2G04580 37.006194 361.34312
AT3G18550 70.475219 173.04537
AT2G18120 55.255489 1470.96453
AT1G46264 216.580896 1747.93401
AT1G36940 200.729102 1026.23640
AT5G51451 766.157332 1580.79444
AT1G19800 331.937375 965.48163
AT3G47600 160.706277 798.01884
AT2G23067 170.783332 1873.77644
AT4G36260 73.170546 1654.18899
AT2G22430 1149.473905 1605.54243
AT5G53950 84.860790 884.13810
AT4G12870 55.085374 407.35588
AT5G37730 93.360562 1228.36873
AT2G19970 220.463882 902.16751
AT2G30130 210.605813 700.04511
AT3G45610 84.139922 426.08257
AT2G03090 371.458152 2058.43024
AT1G26773 359.771998 853.50694
AT5G54510 818.285377 1162.14222
AT4G34170 193.343421 224.55508
AT2G22426 1723.689926 2851.63900
AT5G28495 6.153947 82.86117
AT5G28640 215.628894 976.30948
AT3G27400 180.378939 777.77699
AT3G50870 107.845431 512.52363
AT5G42635 121.719141 712.73841
AT1G06080 53.646855 739.31661
AT3G27503 154.111367 1058.35174
AT3G06019 147.308195 381.94281
AT5G17430 49.325227 1566.92487
AT2G30120 351.861100 564.63707
AT1G26770 171.859465 1508.73037
AT1G75530 421.126297 1489.79347
AT2G18360 107.573063 1029.01661
AT1G51190 54.781158 1501.32891
AT2G46870 43.293888 232.18096
AT3G56360 1714.201893 1541.05663
AT5G25000 142.559436 483.09200
AT2G37210 85.342369 323.84704
AT1G69530 370.235634 1957.73562
AT4G29680 820.638873 1472.55516
AT4G27950 366.843554 1623.35903
AT2G25980 162.614290 404.64441
AT5G53290 951.265221 1532.84002
AT5G24320 279.088580 1013.61733
AT1G19790 58.773442 882.08374
AT5G66690 120.463772 334.81094
AT5G54400 128.879373 572.67766
AT5G56970 79.250733 507.78302
AT4G34990 474.935672 1066.73594
AT3G06010 430.163554 546.57569
AT1G65930 1260.303655 1106.56030
AT1G51175 21.145120 424.34071
AT3G61980 1001.490229 713.02154
AT3G51060 62.557936 1417.48312
AT5G22890 335.104572 1631.43850
AT2G34655 503.514844 1102.44136
AT3G62610 782.516252 1397.33820
AT5G20820 191.297901 998.91166
AT4G27960 842.437791 818.28451
AT4G08770 181.998528 1523.30662
AT4G20320 442.227176 1034.25871
AT4G27260 1324.788649 1604.03150
AT1G68360 910.856780 1292.79945
AT1G70900 508.264107 521.00466
AT3G44710 20.955488 182.01842
AT5G28960 12.805698 182.41394
AT5G28490 87.748928 664.30420
AT4G19980 161.614165 2197.52127
AT3G27480 40.991548 284.90439
AT5G57350 303.882994 497.16172
AT1G73600 251.789429 506.97451
AT5G53500 190.889014 699.05178
AT2G01990 310.712576 807.11166
AT5G50300 103.845801 762.44324
AT1G70370 599.142900 421.12218
AT5G04690 108.197990 544.91675
AT5G41315 68.564621 269.06803
AT1G69820 48.138959 151.84389
AT2G42835 16.751711 126.98859
AT3G22540 226.218095 679.63165
AT1G06350 22.100291 122.92428
AT5G20240 483.595468 537.69011
AT2G43590 259.223670 254.65521
AT1G75520 105.536454 911.40449
AT2G40070 210.501033 231.07452
AT1G19220 931.272455 1937.11997
AT1G12100 141.638170 103.76606
AT1G22250 7.397493 18.39594
AT1G08890 1007.841916 991.38129
AT5G02550 147.682265 522.35266
AT3G02242 94.518155 478.15495
AT3G05165 286.305368 238.12595
AT5G23260 55.835849 129.33527
AT3G26747 221.217016 405.30635
AT4G02910 52.271277 51.73700
AT3G20460 235.253169 422.25544
AT1G12110 182.314172 1072.27956
I believe when comparing TF A percents of input genes between condition, we should drop 0 percent genes. after all, these genes are not hit by TF A. Including these 0 percent genes will bias the final result. I think this is part of the reason why p-value of TF A is similar between two condition.
if we just compare percent of input genes between samples, we will get no significant result
> ARR_value_a <- a[[2]]
> ARR_value_b <- b[[2]]
> wilcox.test(ARR_value_a$percent, ARR_value_b$percent)
Wilcoxon rank sum test with continuity correction
data: ARR_value_a$percent and ARR_value_b$percent
W = 6889.5, p-value = 0.7081
alternative hypothesis: true location shift is not equal to 0
if we can throw percent 0 genes, the result will be better.
> wilcox.test(subset(ARR_value_a, percent > 0)$percent,
+ subset(ARR_value_b, percent > 0)$percent)
Wilcoxon rank sum exact test
data: subset(ARR_value_a, percent > 0)$percent and subset(ARR_value_b, percent > 0)$percent
W = 922, p-value = 0.1079
alternative hypothesis: true location shift is not equal to 0
Please forgive me if I confuse you or I misunderstand something :)
Best wishes
Guandong Shang
Just another example about TF B
> TF_fullInfo_list
<list_of<
tbl_df<
gene_id : character
percent : double
gene_type: character
TF_id : character
sample : character
>
>[4]>
[[1]]
# A tibble: 3,000 x 5
gene_id percent gene_type TF_id sample
<chr> <dbl> <chr> <chr> <chr>
1 AT1G01110 0 background_genes LEC1 Hypocotyl_callus_rep1
2 AT1G01250 0 background_genes LEC1 Hypocotyl_callus_rep1
3 AT1G01320 0 background_genes LEC1 Hypocotyl_callus_rep1
4 AT1G01510 0 background_genes LEC1 Hypocotyl_callus_rep1
5 AT1G01720 0 background_genes LEC1 Hypocotyl_callus_rep1
6 AT1G01820 0 background_genes LEC1 Hypocotyl_callus_rep1
7 AT1G01840 0 background_genes LEC1 Hypocotyl_callus_rep1
8 AT1G01880 0 background_genes LEC1 Hypocotyl_callus_rep1
9 AT1G01930 0 background_genes LEC1 Hypocotyl_callus_rep1
10 AT1G02130 0 background_genes LEC1 Hypocotyl_callus_rep1
# ... with 2,990 more rows
[[2]]
# A tibble: 3,000 x 5
gene_id percent gene_type TF_id sample
<chr> <dbl> <chr> <chr> <chr>
1 AT1G01110 0 background_genes LEC1 Hypocotyl_organ_rep1
2 AT1G01250 0 background_genes LEC1 Hypocotyl_organ_rep1
3 AT1G01320 0 background_genes LEC1 Hypocotyl_organ_rep1
4 AT1G01510 0 background_genes LEC1 Hypocotyl_organ_rep1
5 AT1G01720 0 background_genes LEC1 Hypocotyl_organ_rep1
6 AT1G01820 0 background_genes LEC1 Hypocotyl_organ_rep1
7 AT1G01840 0 background_genes LEC1 Hypocotyl_organ_rep1
8 AT1G01880 0 background_genes LEC1 Hypocotyl_organ_rep1
9 AT1G01930 0 background_genes LEC1 Hypocotyl_organ_rep1
10 AT1G02130 0 background_genes LEC1 Hypocotyl_organ_rep1
# ... with 2,990 more rows
[[3]]
# A tibble: 235 x 5
gene_id percent gene_type TF_id sample
<chr> <dbl> <chr> <chr> <chr>
1 AT1G04443 0.562 input_genes LEC1 Hypocotyl_callus_rep1
2 AT1G04610 0 input_genes LEC1 Hypocotyl_callus_rep1
3 AT1G04983 0.958 input_genes LEC1 Hypocotyl_callus_rep1
4 AT1G05153 0 input_genes LEC1 Hypocotyl_callus_rep1
5 AT1G06080 0.923 input_genes LEC1 Hypocotyl_callus_rep1
6 AT1G06090 0.337 input_genes LEC1 Hypocotyl_callus_rep1
7 AT1G06350 0 input_genes LEC1 Hypocotyl_callus_rep1
8 AT1G06990 0.685 input_genes LEC1 Hypocotyl_callus_rep1
9 AT1G07937 0.677 input_genes LEC1 Hypocotyl_callus_rep1
10 AT1G08133 0.328 input_genes LEC1 Hypocotyl_callus_rep1
# ... with 225 more rows
[[4]]
# A tibble: 235 x 5
gene_id percent gene_type TF_id sample
<chr> <dbl> <chr> <chr> <chr>
1 AT1G04443 0.0848 input_genes LEC1 Hypocotyl_organ_rep1
2 AT1G04610 0 input_genes LEC1 Hypocotyl_organ_rep1
3 AT1G04983 0.608 input_genes LEC1 Hypocotyl_organ_rep1
4 AT1G05153 0 input_genes LEC1 Hypocotyl_organ_rep1
5 AT1G06080 0.400 input_genes LEC1 Hypocotyl_organ_rep1
6 AT1G06090 0.0978 input_genes LEC1 Hypocotyl_organ_rep1
7 AT1G06350 0 input_genes LEC1 Hypocotyl_organ_rep1
8 AT1G06990 0.181 input_genes LEC1 Hypocotyl_organ_rep1
9 AT1G07937 0.255 input_genes LEC1 Hypocotyl_organ_rep1
10 AT1G08133 0.0998 input_genes LEC1 Hypocotyl_organ_rep1
# ... with 225 more rows
# Hypocotyl_callus_rep1
# input genes vs background genes
> wilcox.test(TF_fullInfo_list[[3]]$percent, TF_fullInfo_list[[1]]$percent, alternative = "greater")$p.value
[1] 4.148777e-156
# Hypocotyl_organ_rep1
# input genes vs backaground genes
> wilcox.test(TF_fullInfo_list[[4]]$percent, TF_fullInfo_list[[2]]$percent, alternative = "greater")$p.value
[1] 1.176709e-154
you can see two result looks similar, but the distribution of percent is very different
> summary(TF_fullInfo_list[[3]]$percent)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.07185 0.41670 0.44498 0.82030 1.00000
> summary(TF_fullInfo_list[[4]]$percent)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.01147 0.11733 0.26979 0.49835 1.00000
# it seems that TF B has more influnec on input genes in # Hypocotyl_callus_rep1
> median(TF_fullInfo_list[[3]]$percent) - median(TF_fullInfo_list[[1]]$percent)
[1] 0.4167038
> median(TF_fullInfo_list[[4]]$percent) - median(TF_fullInfo_list[[2]]$percent)
[1] 0.1173348
Best wishes Guandong Shang
Sorry I misused test data in above example, but the final conclusion are same. The TF p-value or rank is robust in different condition when input genes are same. But I did find some thing interesting about wilcox.test in https://github.com/AllenWLynch/lisa/issues/3
Hi Guandong,
I see the issue with the wilcoxon test in this example, and it may be a result of the sparseness of your proximal regions compared to human/mouse samples, where largely ISD scores are > 0, atleast for most query genes. Keeping zeros between query and background genes and removing them ask two different questions. Keeping zeros asks are query genes more likely to be influenced by this TF's binding than background genes (the core LISA question), while removing the zeros asks "given genes that may be influenced by this TF, are query genes more influenced than background genes". I actually think both questions are distinct, but meaningful, and will produce different results for TF p-values. In your case, it is your discretion which question you answer, but it appears that the later may be more appropriate.
This is a really interesting find. Also, I admire your resolved to implement the code in R! Do you have your own ChIP-seq examples as well?
Allen
Hi, Allen Thanks, I believe sparseness of the proximal regions is the the answer of similar p-value !!! My species is Arabidopsis thaliana, which have 3w+ genes while only 2-3w peak. Its genome is only 100+ MB. And the decay dist I set is 1000bp, which I believe is more suitable for Arabidopsis thaliana. After all, there is no significant reason shows that a general long-distance interaction appears on At's genome. Indeed, if I set 1000bp as the decay dist, condiser the small genome and peak numebr, most gene may only have "one or two proximal peak". So it will makes most query genes have score > 0, and the percent will not change even gene's RP have changed a lot. As for the ChIP-Seq data, I am very "jealous" that humans/mouse have good databases like Cistrome DB :). Even Arabidopsis thaliana is the model system for plant, its database for TF ChIP-Seq is poor. So most time the data I used for test is the result of motif-scan for my ATAC-peak.
Best wishes Guandong Shang
Hi, I am trying use lisa to my own ATAC samples. Now I have a Nm matrix and a input gene list, which N represent N ATAC peak region and M reprsent M samples. I am wordeing whether I can apply lisa in more one samples. I noticed that lisa offer the FromRegions* test, but I find it seems that it only works for one sample according to your user_guide. And I find in origin paper, Lisa use the L1 regression to find related samples instead of dealing with my own matched samples. So I am wondering whether you can give me some advices about how to use lisa to serveral samples.
And thanks for this wonderful tool.
Best wishes Guandong Shang