ajank / Romulus

Romulus: Robust multi-state identification of transcription factor binding sites from DNase-seq data
9 stars 0 forks source link

Romulus with ATAC-seq data #1

Closed PFRoux closed 8 years ago

PFRoux commented 8 years ago

Dear Aleksander,

I would like to know if you ever tried to use Romulus on ATAC-seq data ? If so, do you have any advice to analyze this type of data ? Do I need for example to shift the reads before counting the cutting events ?

Thanks a lot,

Best,

Pierre-François

ajank commented 8 years ago

Hi Pierre-François,

great to hear that you are interested in Romulus. I have not tried it on ATAC-seq data yet, but the method should work fine with them. In principle, there should be no need to shift or trim the reads, as long as you provide forward and reverse strand counts separately. If the window size is big enough (200 bp margin around motif seems reasonable), then the algorithm should capture the shape arising from fragment length distribution. You might also try to consider only the reads having fragment length < 150 bp to exclude the nucleosome signal from the analysis.

Please let me know if you encounter any problems. Good luck!

Best regards,

Alek

PFRoux commented 8 years ago

Hi Aleksander,

Thanks a lot for your answer. Do you have R scripts dedicated to generate the strand specific count matrices ?

Best,

Pierre-François

PFRoux commented 8 years ago

Hi Aleksander,

I managed to use Romulus on my ATAC-seq data, but I have a couple of questions that I wasn't able to address re-reading the paper.

1) Regarding the count matrices. In the paper you say " The forward strand cuts are taken only upstream and within the candidate binding site, while the reverse strand cuts only within the candidate binding site and downstream. In this study, we consider a 200 bp margin." Just to make sure I understand it correctly, for a given motif instance, there is an overlap between the two count vectors in the region corresponding to to the motif, while 200 bp on each sides of the motif the counts are either specific to the "+" vector or the "-" vector. Am I right ?

2) I am a little bit confused about the prior you are using in the model. Could you be a little bit more specific and give me advice ? Is considering only the PWM score given by HOMER enough ? Or should I include additional priors ?

3) Regarding the number of bound states. Does that sound reasonable to you to consider only one bound state for each motif I study ? Since considering additional binding modes does not improve the prediction of individual TF binding sites as you explain in the paper ...

4) Regarding the output of Romulus. What are the statistics I should consider to generate binary calls (bound / unbound) ? Is using the posterior probability enough ? What threshold should I use ? Do I have to correct theses probabilities for multiple testing ?

5) Regarding the Binding in Closed Chromatin score. Could you be a little bit more specific on how to compute this score ?

6) How to handle comparison between different conditions ? I am working on a timecourse experiment with 6 time points. I have therefore 6 x 2 ATAC-seq datasets (2 biological replicates per time point). Is there a way to "train" Romulus simultaneously on the whole dataset and then assign posterior probabilities independently for each time point ? And if possible, how to achieve this ? How to format the data ?

Thanks a lot and sorry to bother you with so many questions.

Best,

Pierre-François

PFRoux commented 8 years ago

Hi Aleksander,

In the meantime, I've been working quite a lot on the output of Romulus, scanning my dataset for the whole JASPAR eukaryote core motifs.

Now that I have a little bit more "experience" regarding Romulus, I would like to add a couple of questions.

7) In your opinion, what is the best motif database to use ? In your paper you've been using the scans available on the HOMER website. Since it is limited to around 200 motifs, why haven't you generated your own scans ? It's probably a stupid question.

8) After running Romulus on my dataset for a set of motif occurrences, I've noticed that, in some cases, Romulus give a low probability (e.g. around 10-5) for the state "bound" for occurrences were the number of Tn5 insertion events is really low, or sometimes even equal to zero. How is that possible ? Is a 10-5 threshold still too lenient ?

9) Trying to take advantage of the biological replicates I have for each of my time points, I've been thinking about including an "Irreproducible Discovery Rate" approaches in my Romulus workflow: 1) running Romulus on a set of motif occurrences independently for each biological replicate (rather than merging the alignment files) 2) running the IDR pipeline to target only reproducible "bound" occurence. Have you ever tried such an approach ?

Thanks a lot,

Best,

Pierre-François

ajank commented 8 years ago

Hi Pierre-François,

sorry for the delayed reply! Here are the answers to your questions:

1) You are correct about the count matrices. Please refer to the documentation of the example provided, ?NRSF.cuts (in PDF: http://www.mimuw.edu.pl/~ajank/Romulus/Romulus-manual.pdf). In the dataset NRSF.cuts, columns 1-419 correspond to forward strand cuts, while columns 420-838 correspond to reverse strand cuts at the same positions. Also see ?fitRomulus for an example how the count matrices cuts1 and cuts2 used by the algorithm are extracted.

2) Concerning the priors: we were using only the PWM scores as priors. We also thought about adding evolutionary conservation of the motif site, but did not follow up on this idea. Pique-Regi et al. (2011) included as prior the distance to the closest TSS, as well the evolutionary conservation. Unfortunately I found no information in their paper how much these additional priors contributed to the predictive power.

3) Exactly, I would suggest to consider only one bound state for each motif.

4) Indeed, use the posterior probability (component PostProb). These probabilities can be used to rank the motif instances, but to have a binary classification (bound/unbound) just set a fixed threshold (e.g. 0.5). These are probabilities (derived from a fitted model), not p-values; hence no need to correct them for multiple testing.

5) Following the example in ?fitRomulus, the Binding in Closed Chromatin score can be calculated as follows:

# extract the log-likelihoods from the chromatin state component only
# shift them to avoid numerical underflow, then take exp
lik <- exp(r.fit$LogLikelihood - apply(r.fit$LogLikelihood, 1, max))
# normalize the likelihoods to obtain the probabilities of the chromatin state component
csc <- lik / apply(lik, 1, sum)
# limit the scope to the bound motif instances according to ChIP-seq data (NRSF.anno$signalValue > 0)
# and take the cumulative distribution function of the probabilities of being bound
cdf <- ecdf(csc[, 1][NRSF.anno$signalValue > 0])
# plot the cdf
plot(cdf)
# calculate as the Area-Under-Curve of the above cdf
BCC <- integrate(cdf, 0, 1, subdivisions = 1000)$value

These calculations are made for transcription factor NRSF, in K562 cell line, using Duke DNase and matching ChIP-seq data from ENCODE. The cdf plotted is shown in Fig. 5, panel "NRSF K562", blue line. The resulting BCC value of 0.424 is shown in Fig. 6, panel "Duke DNase", row NRSF, green circle.

6) If all the replicates have the same sequencing depth, I would suggest to concatenate the data for all the replicates as follows. Assume you have N candidate binding sites; with 3 timepoints of 2 replicates each, you would provide Romulus with 6N rows of input data. Of course the prior data will be repeated 6 times, but ATAC-seq data will differ. Then Romulus will learn a timepoint-independent model, and predict 6 binding probabilities (one for each replicate) for each candidate binding site. You could then look for the cases where these binding probabilities change dramatically, or remain stable etc.

7) It is a very good question. :) For a new study, I would use TRANSFAC (if you have a licence) or alternatively JASPAR. In our case, we wanted our test results to be directly comparable to the ones of Piper et al. (2013), so we used the same collection of ChIP-seq peaks and the same motifs from HOMER. We also used the motif scans provided on the HOMER website, this way we could almost reproduce Table S1 in Piper et al. (2013), the values we obtained are shown in Supplementary Table S2. The small differences are caused by HOMER being constantly updated; unfortunately no version history is available.

8) With a probabilistic method, it is unlikely to get a probability of being "bound" equal exactly 0. An occurrence with predicted "bound" probability around 1.0e-5 (if I got it right) is extremely unlikely to be bound. I would rather interpret any binding probability below 0.05 as indicative of lack of binding.

9) It is great that you have biological replicates for your ATAC-seq data. I like your idea of applying the IDR pipeline to identify constitutively-bound sites. We haven't tried anything similar.

Hope that helps,

Alek

PFRoux commented 8 years ago

Hi Alek,

Thanks a lot for your answer.

Going back to the probabilities, I was confused so far because I though they were p-val. I was therefore looking for the smallest values to consider a site as bound, explaining why I was focusing on "bound" sites with a probability as low as 10-5.

What I still don't understand so far is that, despite this completely erroneous selection, I had results that make complete sens. Indeed, on the plots below, you can see the av erage Tn5 insertion profiles around the sites considered as "bound" and "unbound" for the NFKB subunit RELA using this erroneous thresholding (site are considered as bound if p <= 10-5). With this selection, bound sites have a characteristic footprint while the "unbound" ones haven't .

RELA_FALSE.pdf

When I use the right selection (sites are considered as bound if p > 0.5) here is what I get. Here the "unbound" sites have the characteristic footprint while unbound ones haven't. I am sure I am missing something. I am using the second column of the posterior probability matrix, considering it to be related to the "bound" states. But maybe the probability for the "bound" state are actually stored in the first column ? Or should I use the (1-p)of the second column to rank occurence (higher (1-p) would mean higher probability to be bound) ? This would be the only explantation to me.

RELA_TRUE.pdf

Regarding the BCC, now I understand that I would need matched ChIP-seq data. Would that make sens to you to use ENCODE ChIP-seq data to compute the BCC based on ATAC-seq data coming from a completely different system (different cell lines ...) ? Or is there another way to compute a BCC-like index not requiring ChIP-seq ?

Concerning the inclusion of an IDR step in the pipeline, despite the confusions I get with the probabilities reported by Romulus, I already managed to get interesting results. The behavior of Romulus probabilities is quite strange and tend to make the IDR pretty stringent. But since my goal is to get really robust binding events, I am ok with that. If you are interested, I can share a little bit more about it with you.

I take the opportunity I am writing this message to ask you an additional question.

10) How do you manage to generate meta-profiles that are so smooth in the paper (Figure 1) ?

Thanks again Alek,

Cheers,

Pierre-François

ajank commented 8 years ago

Hi Pierre-François,

I am using the second column of the posterior probability matrix, considering it to be related to the "bound" states. But maybe the probability for the "bound" state are actually stored in the first column ? This would be the only explantation to me.

Indeed, it its the case. In the paper, the unbound state is indexed as zero (Z_i = 0), and the bound states are indexed using positive integers. However, in R indexing starts from 1, so the unbound state was moved to the last column.

Would that make sens to you to use ENCODE ChIP-seq data to compute the BCC based on ATAC-seq data coming from a completely different system (different cell lines ...) ? Or is there another way to compute a BCC-like index not requiring ChIP-seq ?

It seems reasonable to use ChIP-seq data from a similar cell type (not necessarily from ENCODE, search in Short Read Archive as well). But the calculation of BCC relies on external information about the motif instances that are actually bound, I do not see a reasonable substitute.

10) How do you manage to generate meta-profiles that are so smooth in the paper (Figure 1) ?

In Fig. 1, the curves for Romulus were smoothed by replacing the fixed-value bins by a piecewise linear function. Following the example in ?fitRomulus, you may plot both the original and smoothed estimates as follows:

make_plot <- function(margin, width, profile1, profile2)
{
  # plot profile1 (forward strand) in black, profile2 (reverse strand) in red
  plot(-margin:(width - 1), profile1, type = "l",
    xlab = "Genomic location with respect to the candidate binding site (base pairs)",
    ylab = "Average DNase I cut density, according to the model of bound state",
    xlim = c(-margin, margin + width - 1),
    ylim = c(0, max(profile1, profile2)))
  lines(0:(margin + width - 1), profile2, type = "l", col = "red")
}

# plot the multinomial parameters
make_plot(NRSF.margin, NRSF.width, r.fit$Lambda1[[1]][r.fit$bins1[1, ]], r.fit$Lambda2[[1]][r.fit$bins2[1, ]])
# plot the multinomial parameters after smoothing
make_plot(NRSF.margin, NRSF.width, r.fit$LambdaReg1[1, ], r.fit$LambdaReg2[1, ])

If you get interesting results from the IDR pipeline, let me know by e-mail :)

Best,

Alek