LiuLabUB / HMMRATAC

HMMRATAC peak caller for ATAC-seq data
GNU General Public License v3.0
99 stars 23 forks source link

Q: Peaks are too long and peak number is low #85

Open WJH58 opened 3 years ago

WJH58 commented 3 years ago

Describe the problem Hi thanks for this tool! I got very low number (~200) of peaks for each sample by hmmratac compared with the number of those peaks (~2000) called by macs2.

What solutions I tried I tried more stringent -u and -l (e.g. l=25, u=40; l=25, u=50; l=25, u=60; l=25, u=100) however I still only get roughly 200 peaks. When -u exceeds 50, the training set will fail. Therefore, I sticked to l=25, u=40 and l=25, u=50.

This is one of the command line I tried:

HMMRATAC --bedgraph true -Xmx46G -b temp/sort/KO1.sorted.bam -i temp/sort/KO1.sorted.bam.bai -g genome.info         -o KO1 -l 25 -u 40 --window 25000 -z 60  

This is one of my log file:

Arguments Used:
--bedgraph      true
-b      temp/sort/KO1.sorted.bam
-i      temp/sort/KO1.sorted.bam.bai
-g      genome.info
-o      KO1
-l      25
-u      40
--window        25000
Fragment Expectation Maximum Done
Mean    50.0    StdDevs 20.0
Mean    198.1872335887453       StdDevs 35.03023353787695
Mean    355.31477875971484      StdDevs 48.17917744848093
Mean    627.1925932441502       StdDevs 75.85766273729656
ScalingFactor   18.001187
Training Regions found and Zscore regions for exclusion found
Training Fragment Pileup completed
Kmeans Model:
HMM with 3 state(s)

State 0
  Pi: 0.3333333333333333
  Aij: 0.333 0.333 0.333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.016 0.057 0.056 0.005 ]

State 1
  Pi: 0.3333333333333333
  Aij: 0.333 0.333 0.333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.139 0.521 0.513 0.068 ]

State 2
  Pi: 0.3333333333333333
  Aij: 0.333 0.333 0.333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.518 1.908 2.03 0.335 ]

Model created and refined. See KO1.model
Model:
HMM with 3 state(s)

State 0
  Pi: 0.3333333333333333
  Aij: 0.987 0.008 0.005
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0 0 0 0 ]

State 1
  Pi: 0.3333333333333333
  Aij: 0.036 0.948 0.016
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.063 0.262 0.049 0 ]

State 2
  Pi: 0.3333333333333333
  Aij: 0.01 0.007 0.983
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.078 0.273 0.371 0.044 ]

I think Baum-Welch algorithm worked because Aij are not 0.333, and signals of state 2 are higher than state 1. However, I only got 197 peaks (including 85 high coverage peaks). Some peaks are very long, consist of many small peaks. This is one of the peak called by hmmratac: KO1_peak

Could you please help me solve this problem?

EvanTarbell commented 3 years ago

This seems like a threshold problem. It might be that for this dataset, the default thresholds for reporting peaks might be too stringent (for both HMMRATAC and MACS). Try lowering the threshold using the -threshold option. The current default is 30, but you might want to try 20 or 10. This should increase the number of peaks called. The model seems fine, as you pointed out, so you wont need to play with -u / -l anymore.

WJH58 commented 3 years ago

Thank you for your quick response! I set -q to 10 and I got 335 peaks for KO1 and 281 peaks for KO2. Compared with previous results, these peak number increased a bit but still low. I wonder why is one peak so long?

Screenshot 2021-04-27 at 21 29 02
SergioRodLla commented 10 months ago

Hi @WJH58, I wonder if you have gained more insights regarding this. How did you generate the bigwig file you used for IGV? Is this better than using the gappedPeak file that HMMRATAC outputs? Thank you