fmicompbio / footprintR

Tools for working with single-molecule footprinting data
https://fmicompbio.github.io/footprintR/
Other
0 stars 0 forks source link

Should `getAnchorRegions` allow `ignore.strand = TRUE`? #13

Closed csoneson closed 1 week ago

csoneson commented 1 week ago

If so, how do we decide how to aggregate the different assays across the two strands? Nvalid should be summed, mod_prob probably averaged? Or do we want to take care of this upstream?

mbstadler commented 1 week ago

I think you raise an important point, and it's not fully clear to me either. It seems there are two things that the strand of regionMidpoints could be used for:

  1. overlapping (matching) of rowRanges(se) to the new regions
  2. inverting the extracted ranges for minus-strand queries

If I understand the code correctly, it currently implements 1. but not 2. Maybe 2. should depend on a new argument reverseMinusStrand = TRUE?

I am outlining a possible behaviour below that could be implemented based on the following use case: The regionMidpoints are TSS and thus stranded. What should be generated is a meta-plot that aligns the promoters of genes at their TSS. The returned object has a coordinate system that is relative to the query regions, with the zero coordinate corresponding to the TSS, and negative/positive coordinates corresponding to regions upstream/downstream of the TSS (in a biological sense).

To get there, we would need:

Regarding the aggregation of strands, I feel like that could be done upstream if at all. What do you think?

csoneson commented 1 week ago

You are right that the code currently does not consider (2) - however, the current version doesn't fully work for (1) either. In particular, there are positions that are represented twice in the input SE; once on the + and once on the - strand. The current code match()es the desired position, ignoring the strand or not depending on the value of ignore.strand, so in principle one could end up with a mix of + and - strand matches, and for positions where both are present, it will just grab the first matching row. I think to get around this (if one would really want to ignore the strand in the matching) one would have to explicitly do two matchings - one with all positions on the + strand and one with all positions on the - strand. However, then one would need to aggregate those results, and depending on the assay, that would perhaps be done in a different way (summing or averaging). So I agree with you that it would be easier if the aggregation of strands was done upstream (where everything is still counts), and then getAnchorRegions() can always take the strand into account in the matching (which means that the provided region midpoints must also be stranded).

For (2), I think your suggestion above makes a lot of sense.

mbstadler commented 1 week ago

Right, I did not think of the case where the same position can be represented by two rows on opposite strands! I guess our data would currently not contain such case, no?

Ignoring the strand in the matching I think is important (for example for the TSS use-case above). Both plus- and minus-strand reads carry information for the chromatin at promoters, and we would not want to lose half of them.

If that makes the implementation clearer/easier, we could be very strict:

At the moment I think most use-cases I can think of would work with that.

csoneson commented 1 week ago

Fixed in https://github.com/fmicompbio/footprintR/commit/ba7ef8eb9c311adab91d0ea8dafb71b37f62da8f and https://github.com/fmicompbio/footprintR/commit/fe9593643dadb81fcb9bcd8c624b099f94dc23c6.