cnobles / iGUIDE

Bioinformatic pipeline for identifying dsDNA breaks by marker based incorporation, such as breaks induced by designer nucleases like Cas9.
https://iguide.readthedocs.io/en/latest/
GNU General Public License v3.0
20 stars 9 forks source link

What does "pooling" mean? #85

Closed iamjli-arsenal closed 10 months ago

iamjli-arsenal commented 10 months ago

Hi there. The docs mention that "Replicates will be pooled during the final analysis". Do you have additional details on what's happening during this step? (i.e. are reads from replicates merged? is each sample run individually and the union of the results taken?)

Thanks!

cnobles commented 10 months ago

Yo! Pooling here just means the samples will be combined. There is a section in the docs (https://iguide.readthedocs.io/en/latest/usage.html#the-three-s-s) that covers the terminology differences used here for specimen and sample. A specimen would be considered a single source of data, like a DNA collection. The sample is made from that specimen and can be replicated, such that each replicate is considered a different sample when doing the iGUIDE protocol. Each sample will have it's own identifier (iGSP0015-neg-1), and during the analysis they will be combined as independent assessments of a single specimen (iGSP0015).

When pooling samples for a single specimen, identification of unique molecular identifiers is independent per sample. This means something a little different for each quantitative approach. For read-based quantification, reads are simply added together for identified sites from samples of the same specimen. For UMI-based quantification, UMI's will be collapsed within a sample and treated as independent counts, even if the same UMI sequence was found in a different sample, then added together for a given site. Similar to UMI's, for fragment / length-based quantification which counts the number of unique lengths of DNA observed, the collapse of unique lengths is done within sample and then added when pooling the information for a specimen.

This last quantification method is quite simple to discuss as an example. Quantification is done by counting the unique lengths of DNA fragments observed. Let's say that sample 1 and sample 2 are replicate samples of specimen A. For one of our on-target sites, we observe lengths of DNA aligning of 25, 30, 30, and 45 for sample 1 and for sample 2, 30, 45, 50, and 55. Toy example, typically you have many more. For sample 1, there are 3 unique lengths (25, 30, and 45) and therefore would be counted as 3 fragments. For sample 2, there are 4 lengths (30, 45, 50, and 55). When pooling these two samples, we would consider these two samples as independent observations of the original specimen, 3 + 4 = 7, so we would say the site has 7 fragment counts. Yes, the length 30 was identified in both samples, but following the protocol, the samples were created prior to any amplification and it is likely that they were two separate pieces of DNA that went into two different tubes and coincidentally had the same length.

Let me know if this helps.

iamjli-arsenal commented 9 months ago

Very helpful, thank you!