LiuLabUB / HMMRATAC

HMMRATAC peak caller for ATAC-seq data
GNU General Public License v3.0
98 stars 23 forks source link

Discrepancy from the same dataset #22

Closed wangjiawen2013 closed 5 years ago

wangjiawen2013 commented 5 years ago

Dear,

I mapped my paired-end ATAC-seq fastq file to ensemble GRCm38 genome and UCSC mm10 genome respectively using bowtie2. The only difference between GRCm38 and mm10 is the chromosome order, which is 1,2,3,,4,5,6,7,8,9,10,11,12,13,14,15,1,6,1,7,18,19,20,X,Y,MT for GRCm38 and chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr1,chr2,chr3, chr4,chr5,chr6,chr7,chr8,chr9,chrM,chrX,chrY for mm10. I have checked the two bam files, I am sure that the only difference is the chromosome order. After running HMMRATAC with the same set up, I find the results differs a lot. HMMRATAC used regions from chromosome 11 to train the model (GRCm38), while used regions from chromosome 9 to train the model (mm10). For GRCm38, 30624 peaks was called, while for mm10 it's 976422 ! Could you give me some advices ? By the way, does the chromosome order in genome_info file would affect the results ? And, the default value of --score “Max” refers to the maximum number of reads mapping to the open region, is the "Max" the same as the maximum coverage of the open region ?

EvanTarbell commented 5 years ago

HMMRATAC calculates the fold change above background and then chooses all sites whose fold change is within the ranges set with the -u and -l options. It then chooses the first 1000 (by default - this number can be changed by option --maxtrain) sites and trains the model on them. Therefore the order can have an effect. I will release a new version where the training sites are randomly chosen from the candidate list, rather than just choosing the first XXX number of them (i have been meaning to implement this anyway).
However, I suspect that might not be the only thing going on. Can you check the log file and in particular the model that is generated by the GRCm38? I have a feeling that Baum Welch training didn't work with that dataset. See the HMMRATAC guide, troubleshooting section, part 5. It is possible that the training sites chosen on chr11 produced an invalid model. The fix i mention above should take care of that issue in general though. Finally, to your last question about the score, yes "max" refers to the maximum read coverage of the open region.

wangjiawen2013 commented 5 years ago

here is a chunk of the GRCm38.model: ....... Training Fragment Pileup completed Kmeans Model: HMM with 3 state(s)

State 0 Pi: 0.3333333333333333 Aij: 0.333 0.333 0.333 Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.744 1.366 1.578 1.07 ]

State 1 Pi: 0.3333333333333333 Aij: 0.333 0.333 0.333 Opdf: Multi-variate Gaussian distribution --- Mean: [ 2.376 4.928 5.055 3.279 ]

State 2 Pi: 0.3333333333333333 Aij: 0.333 0.333 0.333 Opdf: Multi-variate Gaussian distribution --- Mean: [ 8.005 15.753 16.22 7.485 ]

Model created and refined. See results/L9295.model Model: HMM with 3 state(s)

State 0 Pi: 0.3333333333333333 Aij: 0.987 0.009 0.004 Opdf: Multi-variate Gaussian distribution --- Mean: [ 0.514 1.012 1.184 0.802 ]

State 1 Pi: 0.3333333333333333 Aij: 0.006 0.989 0.005 Opdf: Multi-variate Gaussian distribution --- Mean: [ 2.049 4.298 4.305 2.834 ]

State 2 Pi: 0.3333333333333333 Aij: 0.01 0.024 0.965 Opdf: Multi-variate Gaussian distribution --- Mean: [ 3.947 7.215 8.211 4.691 ]

Genome split and subtracted masked regions 0 round viterbi done 2 round viterbi done ......

I have not enough math knowledge to explain the output, especially terminologys on Hidden Markov Model, such as Pi, Aij. Could you give me a brief explanation of the above output or some documentation ? Besides, in the new HMMRATAC version, it's better for users to customize the randon_seed, so that we can reproduce the analysis with the same dataset every time, getting rid of the uncertainty caused by random sampling. And, should we remove mitochondrial reads from the bam files before calling peaks ? Finally, what's the range of peaks number for a goold sample in general ?

EvanTarbell commented 5 years ago

I uploaded version 1.2.7 to help address this. It will randomly shuffle the candidate training list before choosing based on the --maxTrain option. Additionally, i added a option for specifying a seed for that random shuffle, which can be changed by the --randomSeed option.

So it turns out that what i thought was happening regarding the model, didn't. It did use baum welch training to refine the model, so you can ignore my previous comments about that. To answer your questions about the model parameters: Pi is the initial probability that the observation is in state i (this only applies when the scanning begins) and Aij is the transition probability of going from state i to state j. The first model, generated by kmeans, fixes these values to 1/n where n is the number of states (default is 3 states). The second model is one refined by the Baum Welch algorithm (a form of EM algorithm). Although the initial probabilities remain fixed, the transitions are now fitted.

wangjiawen2013 commented 5 years ago

Dear, I cannot find the HMMRATAC_V1.2.4_exe.jar. Did you forget to upload it ?

one more question.

what does this sentence mean "Opdf: Multi-variate Gaussian distribution --- Mean: [ 2.049 4.298 4.305 2.834 ]", especially the "Mean". Sometimes the "Mean" is "Mean: [ 0 0 0 0 ]",is it correct ?

Thank you!

EvanTarbell commented 5 years ago

Version 1.2.7 is here: https://github.com/LiuLabUB/HMMRATAC/releases

Opdf stands for "Observation probability density function". The values listed as "Mean" are the means of the multivariate gaussian distribution that the emission/observations are modeled on. The four values represent the mean values of NFR, mono-nuc, di-nuc and tri-nuc signals for the state. Means of 0 are acceptable and will typically be the background state. Just keep in mind that the values are rounded for outputting, so they arent likely true zero, but close to zero (although true zeros can and do happen too)

wangjiawen2013 commented 5 years ago

The new release is very robust now! Besides, I found that when blacklist was included with -e blacklist, there was no Name_training.bed in the output, while when blacklist was not included, Name_traning.bed was generated in the output. Are there any speical purposes for this ?

EvanTarbell commented 5 years ago

No, that shouldn't have happened. I opened a new issue for that.

wangjiawen2013 commented 4 years ago

Thanks, and could you give me a recommendation on the number of atac peaks for mouse or human genomes ? Then I can use it as one of my qc metrics.