LiuLabUB / HMMRATAC

HMMRATAC peak caller for ATAC-seq data
GNU General Public License v3.0
99 stars 23 forks source link

gappedPeak not represented all chromomsomes dispite of given full ATAC-seq library #62

Open El-Castor opened 4 years ago

El-Castor commented 4 years ago

Hi,

Use case I am using ATAC-seq data from plants materiels with 3 replicates. The data set was build on more or less 60K of cell in differents tissues.

Describe the problem First I launch the model building following this command:

HMMRATAC -Xmx180g -b ${bam} -i ${bamIndex} -g $chromSize -o ${outHMMdir}/${bamName} --minlen "200" --score $scoreType --modelonly True --window "250000" -u 20 -l 5 --maxTrain "500"

Here you can see the log file related to the behind command:

--minlen    200
--score zscore
--modelonly True
--window    250000
-u  20
-l  5
--maxTrain  500
Fragment Expectation Maximum Done
Mean    50.0    StdDevs 20.0
Mean    161.77155876278908  StdDevs 43.520086870689724
Mean    337.3076252975826   StdDevs 59.4745597886981
Mean    548.6062022449369   StdDevs 62.16760969798527
ScalingFactor   64.460541
Training Regions found and Zscore regions for exclusion found
Training Fragment Pileup completed
Kmeans Model:
HMM with 3 state(s)

State 0
  Pi: 0.3333333333333333
  Aij: 0,333 0,333 0,333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0,168 0,268 0,254 0,137 ]

State 1
  Pi: 0.3333333333333333
  Aij: 0,333 0,333 0,333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0,463 0,734 0,632 0,311 ]

State 2
  Pi: 0.3333333333333333
  Aij: 0,333 0,333 0,333
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 1,63 3,087 3,845 2,506 ]

Model created and refined. See PATH to my model
Model:
HMM with 3 state(s)

State 0
  Pi: 0.3333333333333333
  Aij: 0,978 0,02 0,002
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0,136 0,214 0,181 0,019 ]

State 1
  Pi: 0.3333333333333333
  Aij: 0,007 0,988 0,005
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0,174 0,281 0,27 0,169 ]

State 2
  Pi: 0.3333333333333333
  Aij: 0,002 0,017 0,981
  Opdf: Multi-variate Gaussian distribution --- Mean: [ 0,409 0,647 0,581 0,312 ]

Genome split and subtracted masked regions
0 round viterbi done
50 round viterbi done
100 round viterbi done
150 round viterbi done
200 round viterbi done
250 round viterbi done
300 round viterbi done
350 round viterbi done
400 round viterbi done
450 round viterbi done
500 round viterbi done
550 round viterbi done
600 round viterbi done
650 round viterbi done
700 round viterbi done
750 round viterbi done
800 round viterbi done
900 round viterbi done
1000 round viterbi done
1050 round viterbi done
1100 round viterbi done
1150 round viterbi done
1200 round viterbi done
1250 round viterbi done
1300 round viterbi done
1350 round viterbi done
1400 round viterbi done
1450 round viterbi done
1500 round viterbi done
1550 round viterbi done
1600 round viterbi done
1650 round viterbi done
1690 round viterbi done
Total time (seconds)=   3231

Then I launch the peaks calling following these command :

HMMRATAC -Xmx180g -b ${bam} -i ${bamIndex} -g $chromSize -o ${outHMMdir}/${bamName} --model $modelFilePath2 --minlen "200" --window "250000" --score $scoreType --bedgraph $bedGraphOption

My issues is that when I check my .gappedPeak file I have juste some peaks on the first part of the chromosome 10 despite the fact that when I check bedgraph produce I can see all the peaks related to all the chromosome.

More over when I split my bam by chromosome and I launch the same command it work. I have all chromosome in the gappedPeaks file.

Describe the solution you tried I have try to change the parameters -u and -l in the model building step (first command) but it not fix my issue.

Do you have any suggestion please ? Thanks in adavance,

EvanTarbell commented 4 years ago

Is it possible to load the gappedPeak and bedgraph files into a genome browser, along with the raw ATAC-seq signal, and send me a snapshot of chr10? And perhaps another snapshot in a smaller window around the region where the peak calls stop? I think i know whats happening but i want to make sure. In the meantime, you can check out the troubleshooting section of the HMMRATAC_guide, point number 2 (this is what i think is happening). You might have to change the -z , --zscore option.

El-Castor commented 4 years ago

Hi @EvanTarbell,

thanks for respond quickly. I have drag the screenshot of the genome. You right, I have a big signal at the beginning of this chromosome (you can see the two differences comparing the two screenshot).

Capture du 2020-06-23 16-36-51 Capture du 2020-06-23 16-37-48

It is what you think about the error related to the zscore setting ?

Thank in advance

EvanTarbell commented 4 years ago

Yes, it does seem to be an issue. I'd recommend reducing the value of -z , --zscore. Try reducing it to 50 or 30 and see if that helps. You could also create a BED file, with the coordinates of the high coverage region on chr10, and pass that in under the -e option (to exclude that region from the analysis).