Troubleshoot min-max normalization and differences in max values across chromosomes

When working with the scATAC-seq data for GM12878, I was having poor performance of the models despite using deduplication of the scATAC-seq data like our bulk data. The figure below shows the performance is low across most models.

20210417_scATAC_bulk_GM12878_200bp_chr1

The figure below shows that our precision takes a hit for ELF1 20210417_ELF1_PRcurve_200bp_chr1_bulk_scATAC_GM12878

I wanted to double check that the signal tracks did not have errors or if there were obvious problems with the data. The figure below shows the IGV track for bulk, scATAC 500 cells, and scATAC 10k cells. The predictions are similar in numerical value and also shape, but there are more false positives with the scATAC-seq data.

I then looked at the distribution of scores across the scATAC and bulk ATAC-seq for chr1. The figure below shows that the scores for chromosome 1 is different for scATAC-seq vs bulk ATAC-seq. The y-axis shows density and the x-axis is score value. The max scores for the scATAC-seq data are only half of the bulk data.

20210421_bulk_scatac_dist

So I looked at the scores across all chromosomes to see what the maximum value is. We min-max our data between 0 and 1 based on the genome-wide max score. The abnormally high signal is not found in the scATAC-seq data as in the bulk data.

20210421_GM12878_max_scores_per_chromosome

I also compared the max scores of all chromosomes across all cell types we use for training. It looks like there is abnormally high counts on chr1, chr17, and chr21. I further dug into these regions.

20210427_bulkATAC_maxScores

I have started to visualize some of these regions. It looks like there is a single region in chr1 that has abnormally high signal across all cell types. It is adjacent to an annotated blacklist region.

It looks like these regions have the same shape and intensity across cell types. The data shown is minmax normalized between 0,1 to show the max value is associated with these regions. This regions is within the first 600kb of chr1.

The next region is found on chr21 and falls within annotated blacklisted almost entirely encompassing the p-arm of the chromosome.

The third location I looked at shows another high signal region across all cell types. It might be necessary to inspect some of these regions. We could use an approach similar to Anshul Kundaje in the ENCODE blacklist paper.

It looks like the regions of chr21 are representative of ribosomal gene clusters that are largely variable between individuals. Ribosomal gene clusters and NORs NORs and acrocentric chromosomes

I think we should blacklist the short arm of chr13, 14, 15, 21, and 22. When I checked chr14, 13, and 15 were already in the blacklist. The majority of 21 and 22 are blacklisted. We have 21 and 22 because they were recently sequenced with a hg38 patch.

On 28 February 2019, GRCh38.p13 was released, which added the NOR sequences for the short arms of chromosomes 13, 14, 15, 21, and 22. hg38 sequence for NORs

These regions are not reliable because humans have many tandem repeats of these sites and poorly annotated. I also show the mappability track for reads with 50bp length. The regions in chr21 have low mappability.

We should think about excluding regions with low mappability. Maybe use the 100bp read track and remove regions below a specific threshold.

I excluded the regions that were identified above in addition to the short arm of chr21. The overall scale dropped down, but there seems to be regions with high signal in chr17 and chr20 (1 cell type).

20210501_bulkATAC_blacklisted_maxScores

I downloaded the mappability tracks referenced in BPnet. The mappability tracks are downloaded from the original paper for 100bp reads. I started thinking about what threshold to use to cutoff what is mappable. BPnet uses any region that is greater than value=0. This means that it has at least 1 kmer that mapped to it in the simulation. These results are only for single-end reads and there is no data for what it would be like with a paired end read.

I am not sure mappability will help us in this case. It might reduce some of the noise, but some of the highest regions do not align with low mappability regions. The example below shows the highest region on chr17. It is from MCF-7, a breast cancer cell line. This gene is noted to have high expression in cancer and breast cancer cell lines Selectively triggering mitotic failure.

Screen Shot 2021-05-02 at 3 56 05 PM

Based on some of these observations it looks like there might be a difference between transformed cell lines and those that are derived from primary tissue.

Screen Shot 2021-05-02 at 3 59 54 PM

@XiaotingChen suggested using a Q-Q plot to show the distribution of the scores from bulk ATAC-seq compared to scATAC-seq. I calculated the mean score for every bin along chr1 at 100 bp resolution. I tried 1bp resolution, but it was just a dense blob.The distributions are approximately falling along the center dotted line, but there are several areas where there is very high signal in the bulk ATAC-seq and no signal in the scATAC-seq.

20210503_bulk_scatac_qqplot_100bp_mean_RP20M_original

QQ after minmax normalization we see the scATAC-seq scores for chr1 are higher than in the bulk ATAC-seq data for most regions with a few outliers.

20210503_bulk_scatac_qqplot_100bp_mean_minmax_original

Once the curated regions were removed, The q-q plot shows the scores in the scATAC-seq data are higher at each point than in the bulk dataset.

20210503_bulk_scatac_qqplot_100bp_mean_minmaxBlacklisted

The minmax of the data seems to be similar across the same regions of chr1 for both data sets.

20210503_bulk_scatac_qqplot_100bp_mean_minmaxBlacklisted

Zoom of the above plot:

20210503_bulk_scatac_qqplot_100bp_mean_minmaxBlacklisted_zoom

These observation brought me to another point that might be causing issues with our predictions. Our bulk data averages multiple replicates together. We do this using pyBigWig and numpy. Code below. We start by creating an array of zeros the length of the chromosome to average. Then we loop through each cell type and add the values at each position then divide by the total number of cell types/input files.

https://github.com/MiraldiLab/maxATAC/blob/86c0634c9613f64879334ccabc999aadb8b08d6e/maxatac/analyses/average.py#L59

This method produces numpy arrays that have 0's instead of NaN or None values. This is how our bulk data looks.

bulk_RP20M_array[0:10]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

We do not average any files together for the scATAC-seq. So the regions of the genome that have no signal default to None.

scatac_RP20M_array[0:10]
[None, None, None, None, None, None, None, None, None, None]

The first thing I want to test is just replacing the None with 0 for the scATAC-seq data and making predictions again. This will test whether it was an error with how we fill in missing data in each example.

There are also differences in the signal that could be due to alignment. The high regions of signal could be due to us not doing enough quality filtering of our STAR alignments.

We used STAR for our data and Bowtie2 for the scATAC-seq data. There are differences in how the reads are reported and the quality scores that we get with each read. The quality score of STAR is 255 for reads that map uniquely and a log score for tracking multimapped reads. We filter for reads that are mapped (255), but this does not allow us to filter for the quality of match. STAR has other parameters that you can modify to do this. TOBIAS is a ATAC-seq footprinting method that uses the ATAC-seq signal to look for TF footprints. TOBIAS uses STAR and additional parameters to control for number of mismatches in the different seeds, the required overlap of each read, etc..

Below I compare our code and TOBIAS. I excluded path parameters, output parameters, and default parameter settings.

TOBIAS

STAR --alignIntronMax 1
--alignMatesGapMax 2000 \
--alignEndsType EndToEnd \
--outFilterMismatchNoverLmax 0.1 \ # default .3 alignment will be output only if its ratio of mismatches to *mapped* length is less than or equal to this value
--outFilterMatchNmin 20 \ # default 0 alignment will be output only if the number of matched bases is higher than or equal to this value
--alignSJDBoverhangMin 999 \ # default 3 minimum overhang (i.e. block size) for annotated (sjdb) spliced alignments
--alignEndsProtrude 10 ConcordantPair \ default 0 allow protrusion of alignment ends, i.e. start (end) of the +strand mate downstream of the start (end) of the -strand mate
--outMultimapperOrder Random \ # output multi-mapped reads randomly
--outFilterMultimapNmax 999 \ # keep 999 randomly sampled multi-mapped reads
--outSAMmultNmax 1 # report only 1 line (location) for multi-mapped reads in the SAM file

maxATAC

STAR --alignIntronMax 1 \
--alignMatesGapMax 2000 \
--alignEndsType EndToEnd

It might be worth following up on the STAR vs Bowtie2 comparison using GM12878 and aligning it with bowtie2 and comparing the signal tracks to scATAC to see if they are more similar or not. It would also be interesting to see if the regions with high noise are removed. It looks like we need to optimize STAR to remove low quality matches in the parameters and cannot do it with samtools filtering like we do for bowtie2. TOBIAS also has a method for multi mapped reads which is just outputs 1000 random multi-mapped reads. We currently completely exclude these reads.

I benchmarked new prediction based on the np.nan_to_num code change. This actually increased the benchmarking time ~4X and produced the same results based on a test with ELF1 predictions for chr1 at 200bp. This is not the best approach and did not fix the issue.

Bulk Prediction

ELF1_GM12878_FinalModel_SlidingWindow256_200_max_GM12878_chr1_200bp_PRC

New predictions converting none values to 0

GM12878_scATAC_10k_ELF1_scATAC_SlidingWindow256_200_max_GM12878_chr1_200bp_PRC

I checked with @JosephWayman and it looks like the ATAC-seq signal has been adjusted for Tn5 bias, but needs to be centered at the insertion site. I will recenter the signal and slop with 40bp to exactly replicate the bulk ATAC-seq data signal generation. I will test prediction using the new signal tracks.

When I used the correct signal it change the shape of the scATAC-seq signal. From a single example, it looks similar to the bulk ATAC-seq signal. In comments above you can see an IGV track that shows the overall areas of enrichment are the same, but the very peaky signal is not seen in scATAC as in the Bulk ATAC. In the image below, you can see that correcting for the Tn5 bias similarly to the bulk ATAC-seq data makes the signal very close to the bulk data.

Here is the Q-Q plot for the correctly processed scATAC-seq track for 10k cells compared to the bulk ATAC-seq data. It looks like the signal is higher in the bulk ATAC-seq in most positions now. They were previously skewed towards scATAC-seq when the signal was based on the shifted reads and not the insertion sites.

20210503_bulk_scatac_qqplot_100bp_mean_minmaxBlacklisted_correct_signal

This change in the signal did not help though. The predictions and benchmark was exactly the same as using the incorrect signal as input. I show below the AUPR curve for ELF1 at 200bp resolution on chr1 of GM12878 using the corrected Tn5 signal as input.

GM12878_scATAC_10k_ELF1_scATAC_SlidingWindow256_200_max_GM12878_chr1_200bp_PRC

I was still thinking that maybe there is something to do with the scale of the signals. The scATAC seq does not have a lot of the very high signal regions that the bulk data does. So the scATAC-seq signal is not as squashed as the bulk ATAC-seq data that was used to train the models. The last test I wanted to perform was whether I could squash the scATAC-seq signal to be similar to the bulk ATAC-seq signal. To do this I took the max value from the bulk ATAC-seq data before normalization and used it for the max value of the scATAC-seq data during minmax normalization.

The plot below shows the Q-Q plot for the data after MinMax normalization. The data is now skewed towards the bulk data in terms of higher scores at each position. The image only zooms in on the range with the most density.

20210503_bulk_scatac_qqplot_100bp_mean_minmaxBlacklisted_zoom

I tested my predictions using the new minmax signal and surprisingly it produced the same results as our bulk data. A few things to note. The difference between the "scATAC 10k cells Correct" and the "scATAC 10k cells Incorrect" group is that the "correct" group has the proper Tn5 bias correction. The "Incorrect" group is the original scATAC predictions I made several weeks ago using our standard pipeline. The "scATAC 10k cells Correct" group only has 8 points as it was unable to complete prediction and benchmarking within the same walltime as the other runs. I have not rerun them. The "minmaxBulk" is the set that was minmax normalized to the bulk GM12878 max value.

20210506_bulk_scatac_boxplot_200bp_chr1_minmaxBulk

20210506_bulk_scATAC_scatterplot_200bp_chr1_GM12878

Below is an example of an area where there is a false peak in the scATAC-seq data that was minmax normalized based on the max value in the scATAC-seq sample. The minmaxBulk refers to samples that have been normalized based on the max value in the bulk sample. After normalizing to the bulk signal, the data has a lower ATAC-seq signal than before and is closer to the bulk ATAC-seq data. Leopard implements a signal scaling factor that is randomly applied to all tracks. Maybe we can use that approach to make our training more robust to differences in the absolute value of our signals.

These results suggest that there needs to be modifications to the normalization of our data. We are going to try median-mad normalization and also selecting the max value base don the 99th or 95th percentile max value. Another option that is very labor intensive and time intensive is to map all of our data with bowtie2. The other option is to also look more into limiting our signal tracks to regions with a mappability greater than .5. These results suggest that the input data must also be in a similar range or distribution as the data used to train the model. The major difference between the bulk data and the scATAC-seq data is that they are mapped differently in addition to being different technologies.

These sound somehow confusing to me. I think we need to think about what does it even mean to normalize using bulk's max value (which is with a large value), wouldn't it suppress all signal values in the scATAC? And what's the reason for suppressed scATAC signals lead to a better performance? A second question is how do we know if bulk or scATAC is giving a correct peak when they are contradicting with each other?

Normalizing by bulk's max value is a hack, but, it suggests that using a "robust max" like 95th or 99th percentile-max value might overcome what seem to be problematic mapping or experimental differences of bulk verses sc-ATAC. The assumption is that while the absolute maxes between RP20M normalized scATAC-seq and ATAC-seq are different (due to outlier), the 95th or 99th percentile-max value will be more similar. It will be worth it to use the exact same mapping strategy for scATAC-seq and ATAC-seq, so that we can discern whether it's an experimental or STAR-vs-bowtie2 issue.

I started looking into why our ATAC-seq data has really high signal regions that are not present in scATAC-seq data. I used GM12878 data from ENCODE to test whether our processing approach makes a difference. CellRanger uses bwa-MEM for alignment and our method uses STAR. We might not be using the best parameters for STAR since these regions are not present in the scATAC-seq data. I test this by using bowtie2, STAR, and STAR with TOBIAS parameters.

The rationale for using bowtie2 is that is the standard method that we have used for ATAC-seq analysis, but it takes longer. STAR is designed for RNA-seq and must be adjusted correctly in order to use it with ATAC-seq data. The TOBIAS method uses STAR to map reads, but they allow multi-mapped reads and also restrict the matches to have more stringent overlap criteria.

I found that the regions that have high signal with STAR are not present with bowtie2 alignment. These regions are made worse with the TOBIAS settings that allow for multi-mapped reads. Even though TOBIAS has more stringent parameters for alignments, the multi-mapped reads seem to amplify the regions of high signal.

Screen Shot 2021-05-12 at 11 03 18 PM

Most regions seems approximately the same across all alignment methods and experimental types.

Screen Shot 2021-05-12 at 11 09 51 PM

I looked at the bowtie2 vs STAR scatterplot of the max scores in 1000bp bins. There are several places on chr1 that are highly scored in the STAR alignment signal track compared to the bowtie2 signal track. The figures below have the incorrect titles! They are 1000bp bins with the MAX score in each bin not the mean. I will update later.

20210512_GM12878_bowtie2_STAR-maxatac_RP20M_scatterplot

The STAR parameters used in TOBIAS produced very high signal in these problematic regions.

20210512_GM12878_STAR-maxatac_STAR-TOBIAS_RP20M_scatterplot

20210512_GM12878_bowtie2_STAR-TOBIAS_RP20M_scatterplot

When I looked at the scATAC (which uses a modified BWA-MEM) compared to the bowtie2 data, our overall scores were higher in the bowtie2 aligned data.

20210512_GM12878_bowtie2_scATAC_RP20M_scatterplot

I think we could look more into bwa-mem and why it considered to be better or use bowtie2 for our data. I do not think that STAR data is really the best unless we curate many regions we are sus. about.

I think median normalization or 95th/99th percentile based minmax normalization will help us get around these outliers caused by technical choices.

The scATAC-seq data was aligned using the bwa-mem package. I wanted to test whether this method would get around some of the high signal regions that we have been finding. I also wanted to see how similar the results were to our current methods. The bwa mem tool took less than an hour to align the reads, but produced results similar to bowtie2.

20210514_bowtie2_bwa-mem_GM12878

We still have regions of high noise in all alignment methods. These regions are made worse with multi-mapped reads (TOBIAS). I also included a track from the ENCODE ATAC-seq pipeline output. I also included a ChIP-seq track as a comparator.

It looks like this region on chr1 corresponds to a segmentation duplication that has 98% sequence similarity to chrM. We usually remove chrM matches, but these mapped to chr1. I am going to test removing the regions that we have found in the genome in addition to retraining models for 11 TFs using 95th, 99th percentile, median-mad normalization, or using only curated data with min-max normalization.

Things to test:

Normalization : 95th percentile, 99th percentile, median-mad, min-max Blacklist curation : standard blacklist or blacklist + curated regions

Normalization comparisons task list:

[x] ID all of the cell types needed to train 11 TFs
[x] Normalize the data using 4 different approaches. Data has been RP20M normalized and averaged by cell type already.
[x] The RP20M normalization is at bp resolution so we can remove regions that overlap the curated regions as 1 comparison
[x] Train models for 11TFs using 4 different approaches.
[x] Select the best model for each method.

I trained 11 TF models using 5 different normalization approaches. Models trained with the 99th percentile max perform the best overall.

20210523_11TF_GM12878_200bp_chr1_boxplot_comparing_percentile_normalization

For scATAC-seq data, models trained with 99th percentile max value min-max normalization and tested in scATAC-seq data normalized with 99th percentile minmax perform the best.

20210523_11TF_GM12878_200bp_chr1_boxplot_comparing_percentile_normalization_scatac

In the figure below, I took GM12878 and aligned it with 4 different methods. I tested the performance of using different alignment/normalization methods together. These comparisons predict in signal that has been normalized the same way. For example, STAR-maxatac trained with 99th percentile minmax will be tested in STAR-TOBIAS 99th percentile minmax category. I did not perform all pairwise combinations.

20210523_11TF_GM12878_200bp_chr1_boxplot_comparing_percentile_normalization_across_aligners

Together, these results suggest that 99th percentile minmax normalization is robust to different alignment protocols and data types. We will look further into how different alignment methods influence model performance.

MiraldiLab / maxATAC

Troubleshoot min-max normalization and differences in max values across chromosomes #24

Bulk Prediction

New predictions converting none values to 0