CostaLab / reg-gen

Regulatory Genomics Toolbox: Python library and set of tools for the integrative analysis of high throughput regulatory genomics data.
https://reg-gen.readthedocs.io/
Other
101 stars 30 forks source link

Question regarding how Lineplots values are calculated, related to #116 #252

Open AlexBlais74 opened 1 year ago

AlexBlais74 commented 1 year ago

Hello, I am interested in understanding precisely how the lineplots are generated, with rgt-hint --differential for ATAC-seq. This is as a follow up to issue #116, where @lzj1769 offered some information but I am not super clear on the process. The reason for my questions is that this would allow me to generate similar plots, but with varying widths around the TF motif hit, or at a subset of genomic regions of my choice, at subsets of the TF motif hits, etc.

Note: I am working with HINT - Regulatory Analysis Toolbox (RGT) - v1.0.1

Perhaps you can tell me if what I understood is correct. 1-run rgt-hint footprinting --atac-seq --paired-end to identify footprints at regions of interest (e.g. peaks)

2-run rgt-motifanalysis matching in order to locate TF motif hits among the identified footprints. Important at this step to screen a large collection of diverse TF motifs because the stats in "differential" assume a normal distribution with most TFs having no differential activity.

3-run rgt-hint tracks --bc --norm in order to generate wig/bigWig files for each of my samples at regions of interest (e.g. peaks) that are bias-corrected and normalized. This normalization is from Gusmao et al. Bioinformatics. 2014, section 2.1 of the paper (equation #1, within datasets normalization only). Between datasets normalization cannot be applied here since rgt-hint tracks only takes one bam file at a time.

4-run rgt-hint --differential --bc to calculate activity scores for each tested TF, statistics, and generate lineplots.

5-Lineplots are made by taking the ATAC read coverage (e.g. from bias-corrected signal made like at step 3) at TF motif hits (from step 2 above). Here, I suppose that between-sample normalization is also applied (Gusmao et al., equation 2).

In issue 116, Li mentions a normalization step of the signal (additional to or replacing step 5 above?) in the two tested conditions by taking, for each TF, the ratio of activity in condition A divided by activity in condition B (activity values for each TF are calculated from the output file "differential_statistics.txt", activity in condition i being PSi+TCi), then calculating the median of these ratios to obtain a scaling factor so that the ATAC signal in B can be scaled to be comparable to A, to generate the lineplot curves y-axis values.

Did I get that correctly? Additionally, I am curious about the two values provided in "differential_factor.txt". In the case of my own analysis, the values are about 0.45 and 2.25, for conditions A and B respectively. However, if I calculate from the "differential_statistics.txt" file, the median of activity in A is 0.95 and median of activity in B is 0.93, giving a ratio of medians (B/A) of 0.97. Calculating instead of median of the ratio for each factor gives 0.99. I tried to make sense of the python code in reg-gen/rgt/HINT/DifferentialAnalysis.py around lines 234-242 but I am having a hard time understanding how that fits with what Li wrote in the issue thread I referenced above. Reading the function "compute_factors", it seems that the normalization factors are calculated on the sum of ATAC signal at the TF motif hits rather than using the activity scores.

Thanks in advance for any clarification.

Alex

lzj1769 commented 1 year ago

Hi @AlexBlais74

Sorry for the late reply.

1-run rgt-hint footprinting --atac-seq --paired-end to identify footprints at regions of interest (e.g. peaks)

Correct.

2-run rgt-motifanalysis matching in order to locate TF motif hits among the identified footprints. Important at this step to screen a large collection of diverse TF motifs because the stats in "differential" assume a normal distribution with most TFs having no differential activity.

Correct. Here we followed the DEseq2 assumption to normalize the TF activity which was calculated by summing up the number counts around the binding sites within a specific window.

3-run rgt-hint tracks --bc --norm in order to generate wig/bigWig files for each of my samples at regions of interest (e.g. peaks) that are bias-corrected and normalized. This normalization is from Gusmao et al. Bioinformatics. 2014, section 2.1 of the paper (equation https://github.com/CostaLab/reg-gen/issues/1, within datasets normalization only). Between datasets normalization cannot be applied here since rgt-hint tracks only takes one bam file at a time.

Correct. This step is optional and we only ran it to visualize the signal in IGV.

4-run rgt-hint --differential --bc to calculate activity scores for each tested TF, statistics, and generate lineplots.

Correct.

5-Lineplots are made by taking the ATAC read coverage (e.g. from bias-corrected signal made like at step 3) at TF motif hits (from step 2 above). Here, I suppose that between-sample normalization is also applied (Gusmao et al., equation 2).

Here the lineplot are generated by using the signal (after bias correction) around the TF binding sites. In step 3, the signal is based on peaks. We actually re-generated the signal instead of using the signal from step 3. Regarding normalization, indeed, here we have to remove the sample-specific bias caused by sequencing depth. The normalization was perform by using method from DEseq, i.e., median-based normalization.

In issue 116, Li mentions a normalization step of the signal (additional to or replacing step 5 above?) in the two tested conditions by taking, for each TF, the ratio of activity in condition A divided by activity in condition B (activity values for each TF are calculated from the output file "differential_statistics.txt", activity in condition i being PSi+TCi), then calculating the median of these ratios to obtain a scaling factor so that the ATAC signal in B can be scaled to be comparable to A, to generate the lineplot curves y-axis values.

As mentioned above, this normalization tried to scale the signal with a normalization factor for each sample which assumes that most TFs are not changing.

Additionally, I am curious about the two values provided in "differential_factor.txt". In the case of my own analysis, the values are about 0.45 and 2.25, for conditions A and B respectively. However, if I calculate from the "differential_statistics.txt" file, the median of activity in A is 0.95 and median of activity in B is 0.93, giving a ratio of medians (B/A) of 0.97. Calculating instead of median of the ratio for each factor gives 0.99.

The TF activity in "differential_statistics.txt" have been normalized by the scaling factors, so if you calculate the factor based on the normalized data, you will have value close to 1.

I tried to make sense of the python code in reg-gen/rgt/HINT/DifferentialAnalysis.py around lines 234-242 but I am having a hard time understanding how that fits with what Li wrote in the issue thread I referenced above. Reading the function "compute_factors", it seems that the normalization factors are calculated on the sum of ATAC signal at the TF motif hits rather than using the activity scores.

You are right, the normalization factors are computed based on the ATAC signal, and we first normalized the ATAC-seq signal with the scaling factors. Based on the normalize signal, we can compute the activity score by combining TF protetction and accessibility.

Please let me know if there are anything unclear. :)

Zhijian