aidenlab / Juicebox

Visualization and analysis software for Hi-C data -
https://aidenlab.org/juicebox
MIT License
243 stars 58 forks source link

Looplist Stats #348

Open theaidenlab opened 8 years ago

theaidenlab commented 8 years ago

Erez would like stats to pop up when looplists are loaded (or some good way of viewing looplist stats), including:

theaidenlab commented 8 years ago

For Reference

On Fri, Apr 22, 2016 at 3:22 AM, Erez Lieberman Aiden erez@erez.com wrote:

I was curious so I made a simple model of loop call accuracy as a function of convergent %. It assumes that, for any given loop, there is a unique anchor at each end, and each anchor is identified independently with some fixed probability. If the anchor is called correctly, it always points inward wrt the loop; otherwise it points inward with probability 50%. To call a loop correctly you need to call both anchors.

The result is:

https://www.wolframalpha.com/input/?i=A+%3D+4*C-4*sqrt(C)%2B1+plot+from+C%3D1%2F4+to+C%3D1

(if above plot timing out / not showing) https://www.wolframalpha.com/input/?i=plot+4*c-4*sqrt(c)%2B1,+c%3D0.25+to+1

It's a bit surprising to see how slowly accuracy changes despite large deviations in convergent % above 25%.

For instance, at 70% convergent, accuracy is only 46% - ie, 56% of the loops correspond to pairs of loci that do not loop to one another. For Hnisz & Weintraub, et al., the convergence was ~80%; yielding a not-stellar accuracy of 62% .At 90% convergent (what we report), accuracy is 80%. It's pretty interesting that going from c% of 70 to c% of 90 would get you such stark differences - nearly 2-fold - in accuracy.

On Fri, Apr 22, 2016 at 3:37 AM, Erez Lieberman Aiden erez@erez.com wrote:

Tang et al., the Dec paper from the Ruan lab, get 64% convergent, or roughly 37% accuracy.

Mango claims 94% convergence using rad21 chiapet; or 88% accuracy! For H3K4me3, on which it does worst at getting convergent ctcfs, Mango claims 74%,which is only 52% accuracy - a notably high level of disparity given that it is the same pipeline. Pol2 seems much closer to fine.

On Mon, Apr 25, 2016 at 5:21 PM, Erez Lieberman Aiden erez@erez.com wrote:

Summary and a few more notes.

From Earlier Email: "I was curious so I made a simple model of loop call accuracy as a function of convergent %. It assumes that, for any given loop, there is a unique anchor at each end, and each anchor is identified independently with some fixed probability. If the anchor is called correctly, it always points inward wrt the loop; otherwise it points inward with probability 50%. To call a loop correctly you need to call both anchors.

The result is:

https://www.wolframalpha.com/input/?i=A+%3D+4*C-4*sqrt(C)%2B1+plot+from+C%3D1%2F4+to+C%3D1

(if above plot timing out / not showing) https://www.wolframalpha.com/input/?i=plot+4*c-4*sqrt(c)%2B1,+c%3D0.25+to+1

A simple rule of thumb/approximation is that you get 1% of accuracy for every % of convergence above 25% up to 75%; and then 2% accuracy for every % of convergence above 75%.

Here is the reverse, convergence rate as a function of Loop Call Accuracy:

https://www.wolframalpha.com/input/?i=C%3D(1%2F4)*(A%2B2*Sqrt(A)%2B1)+plot+from+A%3D0+to+A%3D1

(if above plot timing out / not showing) https://www.wolframalpha.com/input/?i=plot+(1%2F4)*(A%2B2*Sqrt(A)%2B1),+A%3D0+to+1

Other stuff:

Define Convergence Rate := C Then: Tandem Forward = sqrt(C)-C Tandem Reverse = sqrt(C)-C Divergent: (1-sqrt(C))^2.

(This is in some sense a "standard distribution" that we should be seeing all over the place as people do these convergence tests. Again, this model basically assumes that you call loops in roughly the right places and then tend to screw up the anchors.)

And still more stuff:

Total Tandem = TT= 2_(sqrt(C)-C) Loop Call Accuracy = A = 4_C-4_sqrt(C)+1 = 1-2_TT Error Rate = E = 2*TT

Examples:

Rao and Huntley et al., Cell, 2014 (GM12878 in situ Hi-C) C=0.90 TT(Theoretical)=0.096 D(Th)=0.0025 TT(Observed)=0.096 D(Obs)=.0035 Accuracy: 81% (based on observed C) Accuracy: 81% (based on observed TT)

Tang et al., Cell, 2015 (CTCF ChIA-PET) C=0.645 TT(Th)=0.316 D(Th)=0.38 TT(Obs)=.331 D(Obs)=.024 Accuracy: 37% (based on observed C) Accuracy: 34% (based on observed TT)

(All of this cogently underlines the point that the Ruan tandem rates, reported as real, are in fact consistent with what you would expect due to error when calling convergent loops)

Hnisz & Weintraub, et al., Science, 2016 (CTCF ChIA-PET) C~=0.82 TT(Theoretical)~=0.17 D(Th)~=0.01 TT(Observed)~=0.17 D(Obs)~=.01 Accuracy: ~66% (based on observed C) Accuracy: ~66% (based on observed TT)

Phanstiel et al., Bioinformatics, 2015 (RAD21 ChIA-PET) C=0.94 TT(Th)=0.059 D(Th)=0.001 D(Obs) negligible Accuracy: 88% (based on observed C) Accuracy: 88% (based on observed TT)

So the model seems to work ok. I think it implies convincingly that the tandem loops, like many classes of loops before them, are all statistical artifacts; even the rates of tandem loops observed in these experiments can be predicted more-or-less using the convergent rule.

sa501428 commented 7 years ago

this would be a good project to complete. note that some of the stats are implemented in java under the comparelists command line tool

sa501428 commented 5 years ago

@zozo123 this may be relevant