Ability to simulate simple 3D structure in Hi-C/3C reads

koadman commented 8 years ago

We have discussed various ways of doing this. Some options we covered were:

drawing read pairs with frequencies given by a user-provided contact map
drawing read pairs from a parameterized function which specifies, e.g. circularity or some other chromosome conformation

Challenges involved were/are:

[x] simulating sequencing error profiles with read pairs at arbitrary distance, which we thought could be addressed by using artsim or some other tool to simulate an excess of coverage and then drawing out read pairs at appropriate distances (that weren't originally simulated as a pair)
[ ] A user-provided contact map may not have the same genomic range as the evolved genomes. e.g. if a user provides a 5000x5000 contact map in 1kbp bins, this would have to be scaled to the size of each simulated genome.
[ ] It is not clear what other parametric functions might be appropriate, other than the one for circularity which specifies additional contact between chromosome arms.

cerebis commented 7 years ago

First port of call, the easiest structural detail to approximate is the single primary fold of a circular chromosome, leading to a second anti-diagonal intensity profile in a heatmap. I would suggest this can be represented as a reflection of the primary distribution with a lower intensity.

cerebis commented 7 years ago

Are we treating 3D structure in a secondary or tertiary sense? That is: low observational frequency driven by inaccessible regions due to tight folding, or lets say increased obs freq because of physical proximity due to consistent 3D structural elements (folds)?

koadman commented 7 years ago

First port of call sounds great. Wouldn't 2ndary & tertiary become mixed in the observed Hi-C read pair frequencies? I think it's probably enough to consider just the tertiary at first.

cerebis commented 7 years ago

Prototyping generators.

Initially, I made the mistake of thinking that the anti-diagonal was product of interacting flattened circular chromosomes. Visually, pull a rubber band tight, and opposing strands come into close proximity.

In pursuing that, I cobbled together a multi-modal variation of our empirical distribution, where the second smaller maxima was at the halfway point and followed the same geometrical relationship we've imposed. Well, that turns out to make parallel lines of density in the heatmap (parallel to the primary y=x), with a simple phase shifts of +L/2, -L/2. Sorta, duh! Clearly that won't creating orthogonal traces in the density, where a rotation is necessary, a simple transformation (y=L-y).

So, along comes the correct treatment. If you combine this with the above (because well, I had it coded) you get something that looks like this. There is no need to keep the multi-modal distribution, but its a nice patch-work quilt.

figure_1

or with the regular distribution

figure_1

One thing that is bothering me is the apparent aliasing within these heatmaps. I am hopeful that this is an artifact on the plotter.

koadman commented 7 years ago

Nice one, looks warm & cozy. looking at this i notice now that in some of the real datasets we've looked at there are bigger blobs of heat in the top right & bottom left than are showing up in your simulated data. i guess that suggests some additional contact in the origin region in the population, beyond what can be explained by the flattened circle model.

cerebis commented 7 years ago

Possible confirmation that streaking is related to Matplotlib. The same matrix as above in R heatmap.

rfig

cerebis commented 7 years ago

Now fully derived from simulated reads => BWA => contact map. This was from a community of 8 genomes, derived in two clades of 5 and 3 taxa.

Doesn't look bad. Only 50k pairs, so intensity is weak. (Lowered CM resolution to gain signal.)

outd raw

koadman commented 7 years ago

nice. so the off-diagonal blocks with heat are presumably the result of closely related strains that have had mismapped read pairs? i see a cluster of 3 there, and the other 5 must have one pair that are close as well... anyhow, looks like it's working. would you agree?

cerebis commented 7 years ago

First prototype of random intervals representing CIDs. Each interval has its own empirical distribution, so can be refined ad nauseum. The probability of being CID related is too high here (starving the full range), but was aimed and seeing their result.

Starting to remind me of 1D mappings.

intervals

cerebis commented 7 years ago

Taking the terminology from Tung et al, I have implemented a model of replicon CID regions. These chromosomal interaction domains are what attributed in producing the block density in the heatmap. Stay tuned for a pic via read-mapping.

cerebis commented 7 years ago

cid_1gen raw

My python script only does the upper triangle. The CIDs are randomly generated and in this case turned out a little sparse. (single 3Mbp genome, 250k HiC pairs)

cerebis commented 7 years ago

cid_1gen raw

new seed, 3Mbp genome, 1M HiC pairs

cerebis commented 7 years ago

comparison

Critique against the figure in the paper

The detail is hard to see due to insufficient signal at the given resolution (10 sites per bin).
Some smaller sites are possibly too small and the minimum could be raised.
I need to confirm that the depletion of the background (whole chromosome) distribution is happening along both axes. This may only be happening along the x-axis, suggesting a small oversight in the generation of coordinates.
real-data: are the CIDs visible on the anti-diagonal? Maybe?
real-data: are the upper and lower halves reflections?
intensity of CIDs is possibly too high or the distribution needs more fall-off over its extent. (Geometric shape parameter in this case)

koadman commented 7 years ago

this looks really neat. i don't have a good sense in how much value there is in perfecting it beyond this point -- maybe need to see some use cases for guidance.

cerebis commented 7 years ago

31925f52636c83b1aedcb969d29c3180410665e0

cerebis commented 7 years ago

I think I will close this issue for now.

cerebis commented 7 years ago

A new method for generating CID regions has been implemented. Rather than random intervals, this approach assumes folding regions (interacting domains) are better modelled as a nested (hierarchical) set of intervals.

I am making a note here and posting an example contact map. This map is by generated coordinates only, not a full simulation with actual read-mapping.

In implementing this method, there is a new issue #70 to fine-tune it. This is regarding the sizes of intervals, the probabilities assigned to them.

This figure depicts E.coli chr1, fake chr2 and S.aures chr1. Therefore there are also inter-replicon contacts between chr1 and chr2 of E.coli.

nested

cerebis / meta-sweeper

Ability to simulate simple 3D structure in Hi-C/3C reads #39