HYsxe / PRINT

31 stars 3 forks source link

Normalizing by sub-cCRE activity by Cell Counts/Fragment Counts #21

Closed pl92297 closed 3 weeks ago

pl92297 commented 3 weeks ago

Hello,

Thank for building this great package.

I had a question related to differential analysis of sub-cCRE activity. As defined in the paper, we can quantify activity of sub-cCRE as average TFBS score within the segmented region. However, I imagine that given that the TFBS score indicates confidence of TF binding (and footprinting score is also a test statistic) predicted sub-cCRE activity is dependent on the total number of insertions in the region of interest as higher number of reads would provide greater power. Considering this, if I wanted to perform differential activity analysis on sub-cCRE regions between samples/pseudobulks should I consider adding library size/# of total insertions/ATAC reads/cell counts in the region as a covariate? Given that TFBS scores (CNN output) are not like counts I do not know if it would be appropriate to normalize by read counts.

Thanks!

EDIT: I thought about this a little bit more. Given that the footprinting scores are generated from left/right ratio of insertions I would expect that might correct for the read depth bias somewhat, but the precision of the ratio estimates would still presumably depend on the number of insertions. I see that the background dispersion model was also trained with a variable number of sequencing depths to potentially account for this, but I wonder whether there would be some way to have separate dispersion models trained at different sequencing depths to roughly match the estimated sample library size? I suspect that there may also be some sort of empirical model that captures mean/SD of dispersion at different depths. Another option might be to scale the sample insertions to match the sequencing depth of the training BAC data, but that has its own issues as well.

HYsxe commented 3 weeks ago

Thanks for the question!

You are absolutely right that the overall depth will affect the scores to some extent. Indeed using center/flank ratio as a statistic will to some extent alleviate this issue, but just like you said, when there are lower depth and fewer reads, the model will estimate higher dispersion due to higher count noise, thus restricting the statistical significance.

In a certain way, this reflects the true statistical power of each dataset. If you have a dataset with 20M reads and another dataset with 200k reads, the latter is bound to have lower statistical power, much more noisy insertions, and as a result overall (genome-wide) lower TFBS scores.

However, we do often want to compare across samples. For now I think if you are not expecting a genome-wide gain of TF binding and accessibility (for example a global loss of nucleosome and opening of chromatin) across samples, I think you can try to match the TFBS score distribution in each sample. This can be achieved by either dividing the scores in each sample by it's mean TFBS, or a quantile normalization. Neither of these are perfect normalization methods, and we've had conversations in the lab about building a rigorous differential TF binding testing model like DESeq for RNAseq. So far we haven't had the bandwidth to do this but might look into this in the future.

If you have any suggestions please let us know!

Thank you!

pl92297 commented 3 weeks ago

Great-thank you for the clarification! Your suggestions of quantile normalizing or normalizing to a mean TFBS seem like a good place to start.