TheJacksonLaboratory / diachromatic

Diachromatic is a Java application for preprocessing and quality control of Hi-C and CHi-C data.
https://diachromatic.readthedocs.io/en/latest/
GNU General Public License v3.0
3 stars 1 forks source link

FSI #107

Closed pnrobinson closed 5 years ago

pnrobinson commented 5 years ago

Given this huge number, we reasoned that it is very unlikely that the same cross-ligation event occurs twice. Therefore, we defined the fraction of singleton interactions as the ratio of singleton read pairs and all read pairs.

It is unclear what we gain from this definition. A large FSI is bad but how much does this depend on sequencing depth? How do we use this number to interpret the experiment?

hansenp commented 5 years ago

Yes, the FSI is intended to reflect the amount of cross-ligation between dangling ends of unrelated protein-DNA complexes. I think the ratio of cis and trans read pairs that is commonly used is intended to reflect the same.

This relates also to the required sequencing depth. The greater the proportion of artifactual cross-ligation read pairs, the more reads have to be sequenced in order to get the same amount of valid pairs.

hansenp commented 5 years ago

I revised the text that tries to explain the FSI. Now it looks as follows:

image

@pnrobinson: Is this more clear now? If this is the case, close this issue.

pnrobinson commented 5 years ago

Revised like this

Proportion of singleton interactions (PSI)

The ratio of the numbers of trans and cis read pairs is taken as an indicator of poor Hi-C libraries that contain many chimeric fragments arising from cross-ligations events between unrelated protein-DNA complexes (Wingett 2015, Nagano 2015). The :ref:align subcommand<rstalign> of Diachromatic calculates the CLC that is equivalent to the trans/cis ratio and defined as the proportion of trans read pairs amongst all uniquely mapped unique pairs. However, the trans/cis ratio quality measure may also depend on other factors such as the genome size and number of chromosomes of the analyzed species (Wingett 2015). Diachromatic therefore provides an alternative and possibly more robust quality metric that can also be used to assess the extent of cross-ligation.

Amongst the trans read pairs, we generally observe a large proportion of single restriction digest pairs that occur only once in the entire dataset. The number of all possible different cross-ligation events (including cis and trans) can roughly be estimated as the square of the number of all restriction digests across the entire genome. Given this huge number, we reasoned that it is very unlikely that the same artefactual cross-ligation event occurs twice by chance, and correspondingly hypoithesize that cross-ligation events primarily result in interactions (or digest pairs) with only one read pair. Therefore, we defined the fraction of singleton interactions as the proportion of interactions with only one read pair amongst all interactions.

We typically observe very high PSI around 90%. However, not all of these interactions are necessarily the result of cross-ligation events. There might be other factors that contribute singelton interactions such as occasional non-functional contacts of spatial proximity.

The specific PSI value depends on the restriction enzyme (for instance, there are mauny more digests with four-cutters than with six-cutters) and on the experimental conditions (for instance, we would expect lower values of PSI with capture Hi-C than with Hi-C because of the enrichment step). It is impossible to observe whether a singleton corresponds to cross ligation or to a rare true interaction. However, for the same set of experimental parameters, if one experiment shows an unusually high PSI, this may be a sign of problems in the construction of the Hi-C library.

hansenp commented 5 years ago

90% was for the CTCF depletion Hi-C data. For the capture Hi-C data of Mifsud et al. 2015 there are even higher values between 94% and 98%.

The specific PSI value depends on the restriction enzyme (for instance, there are mauny more digests with four-cutters than with six-cutters) and on the experimental conditions (for instance, we would expect lower values of PSI with capture Hi-C than with Hi-C because of the enrichment step).

At least for data of Mifsud et al. 2015 the claim with the capture proved to be false. The other claim with 4-cutters needs to be verified.

It is impossible to observe whether a singleton corresponds to cross ligation or to a rare true interaction. However, for the same set of experimental parameters, if one experiment shows an unusually high PSI, this may be a sign of problems in the construction of the Hi-C library.

I agree, but I would also point out that cross-ligation might not be the only source of singletons. How about this:

In some cases, we observed very high PSI of around 90% for Hi-C up to 95% for capture Hi-C data. However, not all of these interactions are necessarily the result of cross-ligation events. There might be other factors that contribute singletons interactions such as occasional non-functional contacts of spatial proximity. Furthermore, it is impossible to observe whether a singleton corresponds to cross-ligation or to a rare genuine interaction. However, for the same set of experimental parameters, if one experiment shows an unusually high PSI, this may be a sign of problems in the construction of the Hi-C library.

However, we definitely need to analyze more datasets in order to verify the stated percentages.