deeptools / deepTools

Tools to process and analyze deep sequencing data.
Other
677 stars 212 forks source link

Strange pattern in y axis after Kmeans used in plotHeatmap #972

Closed PLStenger closed 4 years ago

PLStenger commented 4 years ago

I used the V. 3.3.0 version of deeptools (command line, installed from bioconda).

I have Whole Genome Bisulfite data. I run computeMatrix scale-regions -S for each file. I specify I didn't used the whole genome here, but only a selection of interesting genes (into files select_*.bed).

for SELECTION in $(ls $DATADIRECTORY/select_*.bed)
do

for FILE in $(ls $DATADIRECTORY/*.bam_sorted.bam.bw)
do

computeMatrix scale-regions -S ${FILE##*/} -R ${SELECTION##*/} --beforeRegionStartLength 3000 --regionBodyLength 5000 --afterRegionStartLength 3000 -o ${FILE##*/}_${SELECTION##*/}_matrix.mat.gz

done ;
done ;

After I did a computeMatrixOperations rbind -m on my each replicate for binding them, and then I did a computeMatrixOperations cbind -m on my four sample point.

Then I use the Kmeans (=4) technics :

plotHeatmap -m FILE_p0_05_rbind.bed_matrix.mat.gz \
     -out FILE_p0_05_rbind_kmeans_04.bed_matrix.mat.gz.pdf \
     --colorMap RdBu \
     --whatToShow 'plot, heatmap and colorbar' \
     --zMin -3 --zMax 3 \
     --kmeans 4 \
     --outFileSortedRegions FILE_p0_05_rbind_kmeans_04_out_clusters.bed

And here I have a huge y axis (between 0 to 3000), because 2 genes drive the cluster 1.

Capture d’écran 2020-07-15 à 12 07 26

When I delete them in the bed file, I redo analysis and all it's ok.

Capture d’écran 2020-07-15 à 12 07 42

I specify also when I run analysis on the whole genome without gene selection (so the two genes are in the whole set) I obtain Y value between (0 and 40). Also, it's just an example, but with others kind of subset I also obtain sometimes (but not always) this strange huge y axis range, with others genes.

So, before that I assume that the y axis was a kind of methylation percentage (because I always obtain plots between 0 and 40 for y axis), but now I'm not sure about was really is y axis (if it's not a bug in these command lines). So my question is what is exactly the y axis? And/Or if it's well methylation percentage, is there any bug that could explain this pattern?

Thanks in advance!

dpryan79 commented 4 years ago

The Y axis is whatever is in the bigWig files. So if you made bigWig files contain methylation information then that's what you're seeing.

BTW, you can omit the for loops in bash and instead pass in all of the BED and bigWig files in a single command (-S sample1.bw sample2.bw sample3.bw ... -R regions1.bed regions2.bed ...).

PLStenger commented 4 years ago

Thanks a lot Devon for your answer and your advice!

Just to be sure, if I put all the genes (=whole genome) in the bed file and I obtain values between 0-100 assume this is a percentage and your algorithme did an average on the methylation for all genes. But if a subset of the data (and I put few genes in the bed file), and if the Kmeans technique shows peaks with a value more than 2000, that's mean the very few genes in the cluster 1 have 2000 times more methylation than the genes in the other clusters?

Thanks again in advance

LeilyR commented 4 years ago

What you see in heatmap/profile plot are the values in your bigwig file corresponding to the regions you have in your bed file. This is what Devon also said in his comment. If you have all the genes or some of them as long as there are some genes were you have a large coverage in your bigwig file you would still see a large number on the Y axis. Clustering groups your regions based on the similarity in their coverage, that is why you see the regions with higher coverage cluster together.

PLStenger commented 4 years ago

Thank you very much @LeilyR for these clarifications! This is more clear now.