cerebis / meta-sweeper

Parametric sweep of simulated microbial communities and metagenomic sequencing.
GNU General Public License v3.0
10 stars 0 forks source link

Treatment of sparse sites and wgs read-pair noise #69

Closed cerebis closed 7 years ago

cerebis commented 7 years ago

These are really separate issues.

  1. Sparse restriction sites.
  2. Noise from (leakage of) conventional WGS reads

We presently ignore the distance between randomly chosen genomic coordinates and the distance to the nearest restriction site. The algorithm will always return a site, no matter the distance. This does not reflect experimental observation, where for a given mock community and enzyme, there may be few ligation products when the GC% and site are a poor match.

One approach is as follows:

We can impose a condition when finding a site during simulation, so that the algorithm may only stray so far from the initial point. When failing, we can fall back to producing a WGS read or none at all. The condition could follow a geometric distribution (1-Q(d)) and be decided against an unif() toss.

i.e. unif() < 1 - Q(d)

where d is the separation between the nearest site and the random position.

koadman commented 7 years ago

hmm. i'm a big fan of simple but i'm not sure this simplification captures all of the important mechanics of either 3C or Hi-C. It seems pretty close for 3C, the only change I would suggest is to use normal (as in mechanical shearing) or lognormal (as in tagmentation) instead of the geometric for the condition. There is also the 2nd concern of whether a cut site finds another cut site for proximity ligation -- this happens less frequently when cut sites are sparse, so you get a free end. If the subsequent lib prep happens by adapter ligation those free ends will show up at the start of reads, but if nextera is used they will remain unseen in the data.

cerebis commented 7 years ago

It easy to insert any relationship into the model, so I can change it.

That said, I would really like to have some experimental observations to guide the work. I know there is data available, but I have not yet had the opportunity analyze it to answer my questions here. This is sort of a new tack in the simulation.

The way I have refactored the code, we perhaps could make a decision on the ensemble rather than jump to an early termination/logical path.

Two parts A and B. These, under normal conditions, would form a ligation product. Each part has the following detail.

Part: 
  chr,         # chromsome source
  loc,         # a random selected genomic coordinate on chr
  site,        # the nearest restriction site to loc
  delta,       # abs(loc - site)
  frag_length  # the length of the partial ligation product

Now, if we wish to have other outcomes besides a ligation product.

  1. a ligation product of the type produced from the given library protocol (3C/HiC/xxxC)
  2. a spurious ligation product
  3. a total failure, no sequence data produced
  4. a wgs read
    • for inter-chrom, which gets the read? i. chr_A or chr_B ii. chr_A and chr_B

If this is modelled well enough, would the difference between 3C and HiC be just that the biotin enrichment results in a simple additional non-optimal pass filter?

cerebis commented 7 years ago

The rewrite of sim3C has cleared up an issue with sparsity, in that site density now biases replicon selection probability.

However, the nearest-neighbour method still does not fail when invoked regardless of the required shift between x=rand() and the nearest site to x.

Instead, for meta3C we test the condition of sparsity prior to going to this depth in the call hierarchy.

cerebis commented 7 years ago

commit 1777b2f