Closed csoneson closed 1 week ago
I think you raise an important point, and it's not fully clear to me either. It seems there are
two things that the strand of regionMidpoints
could be used for:
rowRanges(se)
to the new regionsIf I understand the code correctly, it currently implements 1. but not 2.
Maybe 2. should depend on a new argument reverseMinusStrand = TRUE
?
I am outlining a possible behaviour below that could be implemented based on the following use case:
The regionMidpoints
are TSS and thus stranded. What should be generated is a meta-plot that aligns
the promoters of genes at their TSS. The returned object has a coordinate system that is relative to
the query regions, with the zero coordinate corresponding to the TSS, and negative/positive
coordinates corresponding to regions upstream/downstream of the TSS (in a biological sense).
To get there, we would need:
rowRanges
of the returned se
is all +
rowRanges(se)
, the strand of regionMidpoints
should be ignored based on ignore.strand
reverseMinusStrand==TRUE
Regarding the aggregation of strands, I feel like that could be done upstream if at all. What do you think?
You are right that the code currently does not consider (2) - however, the current version doesn't fully work for (1) either. In particular, there are positions that are represented twice in the input SE; once on the + and once on the - strand. The current code match()
es the desired position, ignoring the strand or not depending on the value of ignore.strand
, so in principle one could end up with a mix of + and - strand matches, and for positions where both are present, it will just grab the first matching row. I think to get around this (if one would really want to ignore the strand in the matching) one would have to explicitly do two matchings - one with all positions on the + strand and one with all positions on the - strand. However, then one would need to aggregate those results, and depending on the assay, that would perhaps be done in a different way (summing or averaging). So I agree with you that it would be easier if the aggregation of strands was done upstream (where everything is still counts), and then getAnchorRegions()
can always take the strand into account in the matching (which means that the provided region midpoints must also be stranded).
For (2), I think your suggestion above makes a lot of sense.
Right, I did not think of the case where the same position can be represented by two rows on opposite strands! I guess our data would currently not contain such case, no?
Ignoring the strand in the matching I think is important (for example for the TSS use-case above). Both plus- and minus-strand reads carry information for the chromatin at promoters, and we would not want to lose half of them.
If that makes the implementation clearer/easier, we could be very strict:
.checkSEValidity
if we find the same position on opposite strands (allowed by the standard, but not by footprintR
...)ignore.strand
argument and always match with ignore.strand=TRUE
in getAnchorRegions
. If only reads of one strand should be included, subsetReads(se)
can be used before calling getAnchorRegions
reverseMinusStrand==TRUE
At the moment I think most use-cases I can think of would work with that.
If so, how do we decide how to aggregate the different assays across the two strands?
Nvalid
should be summed,mod_prob
probably averaged? Or do we want to take care of this upstream?