distance calculation - Githubissues

liz-is commented 8 years ago

the InteractionSet method gives slightly different results...

gi <- new("GenomicInteractions",
          metadata = list(experiment_name="test", description = "this is a test"),
          anchor1 = as.integer(c(1, 7, 8, 4, 5)),
          anchor2 = as.integer(c(6, 2, 10, 9, 3)),
          regions = GRanges(seqnames = S4Vectors::Rle(factor(c("chr1", "chr2")), c(6, 4)),
                            ranges = IRanges(start = c(1,6,3,4,5,7,2,3,4,5),
                                             width = c(10,9,6,7,6,10,9,8,7,8)),
                            strand = S4Vectors::Rle(c("+", "-", "+", "-"), c(2,4,1,3)),
                            seqinfo = Seqinfo(seqnames = paste("chr", 1:2, sep=""))),
          elementMetadata = DataFrame(counts = 1:5))

## InteractionSet method aliased in GenomicInteractions
pairdist(gi, type = "mid")
# [1]  6 NA  2 NA  2
calculateDistances(gi, method = "midpoint")
# [1]  6 NA  2 NA  2

# old GenomicInteractions method of distance between midpoints gave:
#[1]  5 NA  1 NA  1

#equivalent to:
distance(resize(anchorOne(gi), fix = "center", width = 1), 
        resize(anchorTwo(gi), fix = "center", width = 1), ignore.strand = TRUE)
#[1]  5 NA  1 NA  1

Can @LTLA explain how InteractionSet calculates distances?

Also for other methods, e.g.

## InteractionSet method aliased in GenomicInteractions
pairdist(gi, type = "span")
# [1]  16 NA  10 NA  8
calculateDistances(gi, method = "outer")
# [1]  16 NA  10 NA  8

# old GenomicInteractions method gave:
#[1]  14 NA  8 NA  6

Here I think the old GenomicInteractions method is incorrect. But 'span' does correspond to the punion of the anchors (works in this case as they overlap), and so I think it's correct.

For 'inner' / 'gap' methods, InteractionSet returns negative values for overlapping anchors while GenomicInteractions used to set negative distances to zero. I'm now in favour of keeping negative distances as I think it's pretty easy for the user to set them to zero if they want.

In general, how should we deal with changing how distances are calculated between package versions?

LTLA commented 8 years ago

For the midpoint distances, the differences in behaviour should just be due to rounding, where using distance in the old calculation will round the midpoint and lead to an extra base depending on whether the rounding was up or down. Computing the distances with pairdist preserves fractional midpoints in the intermediate calculations to get a more accurate midpoint-to-midpoint distance. (Incidentally, if the distance itself is fractional, it gets rounded down by pairdist, so the distance is always integer.)

Personally, I don't think you need to worry about changes in the distance calculation. In real analyses, distances should be on the order of kilobases at least, and megabases more typically; an extra couple of bases shouldn't really matter to the results. The exception is with negative values, which are technically correct but might need to be treated carefully, e.g., if you're log-transforming the distances.

liz-is commented 8 years ago

Okay, I've made the behaviour consistent with InteractionSet in e02ba83. I agree that a 1bp difference won't matter at all for real data. I'll put a note about it in NEWS or something.

ComputationalRegulatoryGenomicsICL / GenomicInteractions-old

distance calculation #29