Closed cerebis closed 7 years ago
hmm. i'm a big fan of simple but i'm not sure this simplification captures all of the important mechanics of either 3C or Hi-C. It seems pretty close for 3C, the only change I would suggest is to use normal (as in mechanical shearing) or lognormal (as in tagmentation) instead of the geometric for the condition. There is also the 2nd concern of whether a cut site finds another cut site for proximity ligation -- this happens less frequently when cut sites are sparse, so you get a free end. If the subsequent lib prep happens by adapter ligation those free ends will show up at the start of reads, but if nextera is used they will remain unseen in the data.
It easy to insert any relationship into the model, so I can change it.
That said, I would really like to have some experimental observations to guide the work. I know there is data available, but I have not yet had the opportunity analyze it to answer my questions here. This is sort of a new tack in the simulation.
The way I have refactored the code, we perhaps could make a decision on the ensemble rather than jump to an early termination/logical path.
Two parts A and B. These, under normal conditions, would form a ligation product. Each part has the following detail.
Part:
chr, # chromsome source
loc, # a random selected genomic coordinate on chr
site, # the nearest restriction site to loc
delta, # abs(loc - site)
frag_length # the length of the partial ligation product
Now, if we wish to have other outcomes besides a ligation product.
If this is modelled well enough, would the difference between 3C and HiC be just that the biotin enrichment results in a simple additional non-optimal pass filter?
The rewrite of sim3C has cleared up an issue with sparsity, in that site density now biases replicon selection probability.
However, the nearest-neighbour method still does not fail when invoked regardless of the required shift between x=rand() and the nearest site to x.
Instead, for meta3C we test the condition of sparsity prior to going to this depth in the call hierarchy.
commit 1777b2f
These are really separate issues.
We presently ignore the distance between randomly chosen genomic coordinates and the distance to the nearest restriction site. The algorithm will always return a site, no matter the distance. This does not reflect experimental observation, where for a given mock community and enzyme, there may be few ligation products when the GC% and site are a poor match.
One approach is as follows:
We can impose a condition when finding a site during simulation, so that the algorithm may only stray so far from the initial point. When failing, we can fall back to producing a WGS read or none at all. The condition could follow a geometric distribution
(1-Q(d))
and be decided against anunif()
toss.i.e.
unif() < 1 - Q(d)
where d is the separation between the nearest site and the random position.