What threshold should be used to keep the peaks?

LiuLabUB / HMMRATAC

HMMRATAC peak caller for ATAC-seq data

GNU General Public License v3.0

99 stars 23 forks source link

What threshold should be used to keep the peaks? #26

Closed xpan1 closed 5 years ago

xpan1 commented 5 years ago

In my data of all samples, the results showed significant different peak numbers cross all the samples? Why that happened? What should I do with that? In the Troubleshooting part5, I checked my model, the thing I don't understand is emission parameters, what do they mean? and what are the numbers in the log files? And also, can I use different -u -l parameters to run each samples? If so, can compare the open chromatin regions of different samples?

EvanTarbell commented 5 years ago

HMMRATAC, by default, reports all peaks that match the model. We typically perform a post analysis filtering to eliminate the weaker peaks, which are likely driving the differences that you are seeing between samples. Step 5 in the quick start guide or Part 6 of the troubleshooting section of the guide can explain an easy way to do that. I'd recommend starting with a threshold of 10 (assuming the default scoring system ie. "max"). Keep in mind that different datasets may require different settings, but that is the case for any algorithm. Its actually a misconception that one threshold will apply to all datasets.
As for the emission parameters, these represent the means of the four signals used in the model. HMMRATAC separates the ATAC-seq signal into 4 distinct signals to represent those reads enriched for open regions, mono-, di- and tri- nucleosomes. It is the pattern of these four signals that the method learns and uses to identify the peaks. Like i said above, all peaks that match the model will be reported, even if they have low coverage, hence the post analysis filtering.
As for using different -u and -l settings, yes that could also be used. Generally, the higher the range the more stringent the model becomes which in turn means stronger peaks. Performing the filtering outlines above results in similar behavior as increasing the -u and -l range and is faster and easier. IE, the higher the threshold, the higher the precision goes, same as increasing the -u and -l range.

xpan1 commented 5 years ago

Thanks, I forgot to mention that there are 85,236 peaks in one sample, another sample has 428, 479 peaks, and another one has 37, 679 peaks. Thats really a big difference even after filtering using score > 10, so is that common? So, for each sample, I will use different threshold to filter? So how to decide threshold, randomly by increasing the number? Btw, the emission parameters are the means 50.0, 126.22896007338085, 382.53888001030833, 710.6897656984065, not are the means in []? If that way, in the troubleshoot part5, the open state should have the highest emission parameters, the nucleosome state should have the second highest, and the background have the lowest. In that way, the means in the model seem like did the opposite way? Did my model look good? Thanks a lot for the help!!

I attached my output here: Fragment Expectation Maximum Done Mean 50.0 StdDevs 20.0 Mean 126.22896007338085 StdDevs 4.4991288974337085 Mean 382.53888001030833 StdDevs 114.65646070116397 Mean 710.6897656984065 StdDevs 78.58295883523067 Training Regions found and Zscore regions for exclusion found Training Fragment Pileup completed Kmeans Model: HMM with 3 state(s)

State 0 Pi: 0.3333333333333333 Aij: 0.333 0.333 0.333 Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.041 0.008 0.185 0.33 ]

State 1 Pi: 0.3333333333333333 Aij: 0.333 0.333 0.333 Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.287 0.021 2.362 1.117 ]

State 2 Pi: 0.3333333333333333 Aij: 0.333 0.333 0.333 Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.53 0.053 4.312 2.655 ]

Model created and refined. See 10DAP_merge.model Model: HMM with 3 state(s)

State 0 Pi: 0.3333333333333333 Aij: 0.333 0.333 0.333 Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.041 0.008 0.185 0.33 ]

State 1 Pi: 0.3333333333333333 Aij: 0.333 0.333 0.333 Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.287 0.021 2.362 1.117 ]

State 2 Pi: 0.3333333333333333 Aij: 0.333 0.333 0.333 Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.53 0.053 4.312 2.655 ]

EvanTarbell commented 5 years ago

OK, so you actually have a different problem. The ones that have such a low number of peaks appear to be bad models. This is actually discussed in troubleshooting part 3. At least partly. Let me explain what happened.

HMMRATAC first creates a model using KMeans. This model is just the initial one that is then used as a starting point for training. This model has as its initial and transition estimates as 1/n, where n is the number of states (3 by default). The final model is produced using the Baum Welch algorithm. Although we maintain the proportional initial parameters, it is supposed to update the transition parameters. When Baum Welch fails, previous versions of HMMRATAC would simply report the bad model and then crash out. Current behavior is to use the older initial model to call peaks, to provide the user at least some result. You can tell that the BW failed because the Aij parameters are all 0.333 (these are the transition probabilities of going from state i to state j). My guess is that the run that only produced 37k peaks has this and possibly the one that produced 80k. In this case you have to re-run HMMRATAC with stricter -u and -l settings. I recommend -u 40 and -l 20. This will identify stronger training peaks and help alleviate the problems that Baum Welch is facing.

As to the other question, the Means that you talk about are different. These are the means of the fragment EM. In order to separate the ATAC-seq signal into the 4 tracks that HMMRATAC uses, we first need to find ideal parameters for choosing which fragments belong to which track (simplified explaination). So these are those means. The means listed with the model, for example from state 2: Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.53 0.053 4.312 2.655 ] are the emission means to judge the model by and what i was referring to in the previous post.

I hope this helps. Let me know if changing the -u and -l resulted in better models or if you have any other questions. Ill also update the guide to include some of this info and update the part about checking the model to account for the new behavior, to help people in the future.

xpan1 commented 5 years ago

I changed -u 40 -l 20, but still the peak numbers vary a lot. But I checked the log files, which showed normal transition probabilities, other than one sample (sample 1) still showed 0.333,0.333,0.333. I am not sure what's going on with that sample. To be clear, both sets of parameters showed normal transition probabilities in all samples, except sample 1. Here I showed the peak numbers in Name.gappedPeak files. So I am not sure in this situation what should I do, I can do further filtering based on these results, but the filtering threshold has to be different to get consistent peak number in all samples. Furthermore, in the Name.gappedPeaks files, the peak number are not consecutive, not like macs2, should be like peak1,peak2, peak3..... however in my file, it look like peak1,peak10, peak12....seem likes the peak got filtered somehow, why that happened? BTW, how many peak numbers do you usually get? And how many peaks numbers I should keep for "real" open chromatin region? -u 40 -l 20 3,238 sample1 569,330 sample2 41,917 sample3 56,945 sample4 404,211 sample5 default -u -l 85,236 sample1 566,570 sample2 37,679 sample3 271,986 sample4 428,479 sample5

EvanTarbell commented 5 years ago

So for your Sample1, you can try some other settings for -u and -l to get a valid model. Options include going even higher, say -u 50 and -l 30, or extending the range, say with -u 40 and -l 10. It is also possible to run your sample 1 data with a model created with a different data set. Although i would try and re-run it first.

As for the number of peaks, it depends on the sample prep, on the number of reads sequenced and the depth of sequencing. Each sample is always a little different. If each sample needs a different threshold, that is very common. As i said, its a misconception for one threshold to work with all data, all the time.

Now there are a number of things that you can do. If you want the same numbers of peaks from each sample, an easy way to do that is to simply sort the peaks by the score and choose the top XXX number of peaks from each sample. This way you have consistent numbers of peaks from each sample. Another way is to use the same threshold, but choose one that results in similar numbers of peaks. Its doubtful that you will ever get exact numbers, but a threshold of say 20 could give similar numbers. Finally, you could perform a cutoff analysis (similar to what is done in MACS2 with its --cutoff-analysis option). What you do is run the peak file with increasing cutoffs (say 0, 5,10,15,20,25...50) and then recording the peak numbers (or total peak length, average peak length etc). Plot out those two values in R or excel (peak number vs threshold) and find the point of the elbow (the point where the curve begins to saturate) and choose that as the best threshold for that data set. Again, this may not result in identical peak numbers, but that is still common.

xpan1 commented 5 years ago

I extended the range -u 40 -l 10, this gives better results and the BW parameters are no longer 0.333 anymore? Why this parameter give better results compared to -u 50 -l 30, can you give a brief and easy-understanding explanation?

EvanTarbell commented 5 years ago

What is happening when BW fails, is that there isn't enough separation between the states of the model. Lets imagine a simpler model, one with three states (like HMMRATAC) and 1 emission value (whereas HMMRATAC has 4). If the mean emission values for state one is 1, for state two is 2 and for state three is 3, then imagine trying to classify an observation of 2.5. The model would assume equal probability of being either state 2 or state 3. Therefore, it can't classify that observation. That is generally what happens when BW fails in HMMRATAC. Setting different -u and -l options identifies different training regions, ideally ones where the difference between states is clear and sharp. So what could be happening is that with the -u 50 -l 30 options, you may not have many training regions where that separation is clear. The -u 40 and -l 10 option may simply be identifying better training regions for HMMRATAC to build the model.