HYsxe / PRINT

32 stars 3 forks source link

Question about CRE regions and feature requests #12

Closed maxdudek closed 11 months ago

maxdudek commented 1 year ago

Hi,

I really like the methodology of PRINT over every other footprinting tool, but at the moment I'm having trouble figuring out how to get output in a format that is comparable to other footprinting tools I've used.

Specifically, I have two requests for features which I think are obvious functionality for a footprinting tool to have, but I'm not sure how difficult it would be to implement them - or even if they are already implemented in a way that I haven't found yet.

1. Allow for variable length regions. I would like to calculate TFBS for all regions in open chromatin, and my regions have widely different lengths. For example, here's a histogram of the lengths of my open chromatin regions:

image

The tail stretches quite a bit to the right as well, meaning several regions are >> 1kb. If I resize them all to 1kb, I'm losing a lot of these regions, not to mention overlapping with other regions. I don't think that this is particular for me, I expect that a lot of people would have the same issue.

2. Index TFBS scores by genomic position, rather than by region Right now, if I want to get the TF habitation score at, say, a specific sequence motif at chr1:xxxx-xxxx, I need to first find the correct chunk, then the correct CRE, then iterate through the GRange object to find the correct region. This is totally impractical if I want to score millions of motif sites - it would be very convenient to have a function that either (1) returns the score for a particular region, or (2) outputs TFBS scores to a standard track file format (e.g. bigwig) which can then be read using existing methods.

I would love to hear your thoughts on these features, if they are practical to implement, or if they contradict the intended use-cases/paradigms of PRINT. Please also let me know if I am misunderstanding anything about the software and how it is meant to be used, I've only started testing it out so I could be approaching it all wrong.

Finally, if there are any work-arounds that you can think of to do what I'm trying to do, that would be excellent. Thank you very much for your time!

HYsxe commented 1 year ago

Hi Max!

Thanks for these great questions!

Regarding the first question, we currenty designed the code so it takes regions with the same size because many modalities (such as Tn5 bias) can therefore be stored in the format of region-by-base pair position matrices, which can easily support operations like subsetting, row/column sum etc. Usually when I have regions of various length, say 500bp to 1000bp, I just resize everything to 1000 bp and run the code. You can subset your results very easily based on genomic coordinate (which brings us to the second question of how you pair the scores with positions)

For the second question, you can use the getTFBindingSE() function to generate a site-by-pseudobulk RangedSummarizedExperiment object of the TF scores. The rowRanges will be the genomic coordinates and columns will be your pseudo-bulk. For details on this function you can see page 6 of our BMMCVignette.pdf tutorial.