exomiser / svart

API for use in representing and manipulating genomic variation
7 stars 1 forks source link

Unexpected results after converting empty ZERO_BASED GenomicRegion to ONE_BASED #15

Closed ielis closed 3 years ago

ielis commented 3 years ago

Consider a code where we create an empty ZERO_BASED region. Then, the following code yields:

GenomicRegion region = GenomicRegion.zeroBased(chr1, 1, 1);
System.err.println(region.withCoordinateSystem(CoordinateSystem.ONE_BASED));

ContigRegion{contig=1, strand=+, coordinateSystem=ONE_BASED, startPosition=2, endPosition=1}

Note the inconsistency of the startPosition and endPosition values.

ielis commented 3 years ago

I think we need to add a constraint to the GenomicRegion - the end position must be greater than start position for 0-based regions (end and greater than or equal to start in 1-based specs).

This is because the end coordinate is closed (included) in both 0-based and 1-based coordinate system. This implies that the region always includes some position, hence it cannot be empty.

This means that we must remove the GenomicPosition.asRegion() method. We should retain the GenomicPosition.asRegion(int padding) and GenomicPosition.asRegion(int upstream, int downstream) though. then it is not possible to convert the region to CoordinateSystem.ONE_BASED.

I implemented the changes in #12

julesjacobsen commented 3 years ago

I re-discovered this just now whilst creating the BaseGenomicRegion class

@ParameterizedTest
@CsvSource({
        "ONE_BASED,  1, 1,  ZERO_BASED, 0, 1",
        "ZERO_BASED, 0, 1,  ONE_BASED,  1, 1,",
        "ZERO_BASED, 0, 0,  ONE_BASED,  1, 1"
})
public void singleOrEmptyBase(CoordinateSystem inputCoords, int inputStart, int inputEnd, CoordinateSystem exptCoords, int exptStart, int exptEnd) {
    GenomicRegionDefault instance = GenomicRegionDefault.of(chr1, Strand.POSITIVE, inputCoords, Position.of(inputStart), Position.of(inputEnd));
    GenomicRegionDefault expected = GenomicRegionDefault.of(chr1, Strand.POSITIVE, exptCoords, Position.of(exptStart), Position.of(exptEnd));
    assertThat(instance.withCoordinateSystem(exptCoords), equalTo(expected));
}

results in:

Fails  "ZERO_BASED, 0, 0,  ONE_BASED,  1, 1"
Expected :<GenomicRegion{contig=1, strand=+, coordinateSystem=ONE_BASED, startPosition=1, endPosition=1}>
Actual   :<GenomicRegion{contig=1, strand=+, coordinateSystem=ONE_BASED, startPosition=1, endPosition=0}>

I think that some of this odd behaviour stems from removing CoordinateSystemed from GenomicPosition. A GenomicPosition.zeroBased(0) and oneBased(1) are not the same thing as the first is a slice before the first base and the second is the first base.

i.e. zeroBased(n) is a slice before base n +1 or a zeroBased(n, n), but can't be represented in one-based
oneBased(n) is the base n or a zeroBased(n-1, n) or a oneBased(n, n)

Currently GenomicPosition Making GenomicPosition implement CoodinateSystemed again should help to make this easier. The api should work with 0-based internally as this can represent both zero- and one-based regions/positions.

Alternatively a simple fix might be to add this to withCoordinateSystem():

// zero-length region is an interbase position!
if (requiredCoordinateSystem == CoordinateSystem.ONE_BASED && this.length() == 0) {
    return newInstance(contig, strand, requiredCoordinateSystem, normalisedStartPosition(requiredCoordinateSystem), endPosition().shift(coordinateSystem().startDelta(requiredCoordinateSystem)));
}
return newInstance(contig, strand, requiredCoordinateSystem, normalisedStartPosition(requiredCoordinateSystem), endPosition);

and then change the expected behaviour of these two tests in GenomicRegionTest.oneBasedRegionContainsARegion:


"ONE_BASED, 1, 1,   ZERO_BASED, 1, 1,   true",  -> false (zeroBased(1, 1) becomes oneBased(2, 2) which is not fully contained within region oneBased(1, 1)
"ONE_BASED, 2, 3,   ZERO_BASED, 3, 3,   true", - > false (zeroBased(3, 3) becomes oneBased(4, 4) which is not fully contained within region oneBased(2, 3)
julesjacobsen commented 3 years ago

Or as you say disallow 0-length regions. However, why remove the GenomicPosition.toRegion() method? Just return a zeroBased region with a 1-base downstream padding i.e.


GenomicPosition.of(chr1, POSITIVE, Position.of(pos)).toRegion();
// should equal
GenomicRegion.of(chr1, POSITIVE, ZERO_BASED, Position.of(pos), Position.of(pos + 1))
ielis commented 3 years ago

I am not sure that we need to be able to represent an interbase position/slice as a zero-length region. We can use GenomicPosition for that, can't we?

I would prefer disallowing 0-length regions. I also prefer removing GenomicPosition.toRegion() method, because we cannot implement a consistent behaviour for all genomic positions of a contig. Depending on whether we choose to map the position to a region by shifting upstream/downstream, the mapping fails for the first/last contig position. We can still make a position by calling either GenomicPosition.toRegion(0,1) or GenomicPosition.toRegion(1,0) explicitly.

julesjacobsen commented 3 years ago

Yep. Commit 2024746217f663ff16c77357f87d5943fb558b5f contains code and tests to prevent 0-length regions. This will need to be documented in the README, which is incredibly out of date now.