CL-CHEN-Lab / RepliSeq

R package for the analysis of Repli-Seq data to study DNA replication timing program: From count matrices, data normalization to replication timing calculation.
GNU General Public License v3.0
7 stars 2 forks source link

Does URI calculation assume that the global replication timing is the same? #2

Open GMFranceschini opened 4 months ago

GMFranceschini commented 4 months ago

Dear developers, I am experimenting with the package to analyze some data. I noticed a great difference between using the difference of time computed in two conditions (Treated - Control) and the URI computed with the corresponding function on the same data. I checked the code: if I get this correctly, it normalizes by the average replication timing within the sample and then makes the comparison. If everything goes slower or faster in a condition, will that signal be visible through URI? Thank you

Gian

SamiLhll commented 4 months ago

Dear Gian,

Thank you for your question.

No it doesn't assume that you have the same replication timing in both conditions. We initially implemented the URI to quantify the effect of Aphidicolin on human lymphocytes, which slows the progression of DNA polymerases and thus perturbs globally the replication timing. It is computed as a Z-score.

I hope that it helps. Don't hesitate if we can help with anything else. Best, Sami

GMFranceschini commented 4 months ago

Thank you Sami, I was asking because I observed very different results using the two strategies to compute the difference in replication timing. So, is the z-score computed internally in a sample, and are the z-scores compared between the two samples? I might stick to simply using the difference between the S50 values, as it seems more straightforward to interpret - is there any reason to avoid doing that? I appreciate your help

SamiLhll commented 4 months ago

Yes I can imagine how different these results can be. The URI is calculated as the Z-score of the overall difference between the two conditions (the sum of all fractions of the S-phase in each case). It was implemented to identify the most perturbed regions of the genome upon Aphidicolin (so putting the focus on outliers when computing the difference).

I would say that you won't address the exact same questions with the two methods but I don't see any reason why you should avoid comparing S50 values as long as your data allows you to have reliable results this way (which might need more context to truly state on that point)

I'm sorry if that sounds foggy, and don't hesitate to add more context if needed. I'll do my best to try to see how this package can help you to process your data.

GMFranceschini commented 4 months ago

Thank you a lot for your input. I will ask you for two minor clarifications:

SamiLhll commented 4 months ago

Yes you should normalize every fractions by the total read count. I used to do it as following : 1 - Compute the sum of reads in each 50kb windows using the sum() function of R 2 - Compute for each fraction a ratio with a fixed and realistic value. 3 - Use these ratios in Repliseq::normalizeRS() function.

When we designed the Repliseq experiments using 6 different timepoints during the S phase, we named the data G1/S1, S2, ..., S6/G2/M with S1 being the earlier timepoint in the cell cycle.

To finish on the usage of Repliseq to normalize, when you have a ratio for each fraction, then you provide these ratios as a list to the function normalizeRS.

So you'll have the following :


#### load data ---
temp_paths <- c("../inst/extdata/NT_chr22-s1.bdg","../inst/extdata/NT_chr22-s2.bdg",
                "../inst/extdata/NT_chr22-s3.bdg","../inst/extdata/NT_chr22-s4.bdg",
                "../inst/extdata/NT_chr22-s5.bdg","../inst/extdata/NT_chr22-s6.bdg")
temp_fractions <- c("S1","S2","S3","S4","S5","S6")

raw_repliseq_assay <- readRS(temp_paths,temp_fractions)

#### normalize by total read count ---

Total <- 10000000 # Use 10M for example
ratios <- raw_repliseq_assay %>%
    summarise(S1 = sum(S1) / Total, S2 = sum(S2) / Total, S3 = sum(S3) / Total,
S4 = sum(S4) / Total, S5 = sum(S5) / Total, S6 = sum(S6) / Total)

normalized_RS_assay <- normalizeRS(raw_repliseq_assay, ratios)

Best,

Sami