GenomicMedLab / cool-seq-tool

https://coolseqtool.readthedocs.io
MIT License
4 stars 0 forks source link

`GenomicData` should include start/end for both start_exon and end_exon #327

Closed korikuzma closed 4 months ago

korikuzma commented 4 months ago

Feature description

We should be returning both start and end coordinates for exons.

Use case

Griffith lab would like to use ExonGenomicCoordsMapper via Fusion Builder API and have said they need both coordinates.

Proposed solution

Alternatives considered

No response

Implementation details

Thinking of updating the structure to something like this:

class Exon(BaseException):
    """Model for exon data"""

    genomic_start: StrictInt
    genomic_end: StrictInt
    offset_start: StrictInt | None = 0
    offset_end: StrictInt | None = 0

class GenomicData(BaseModelForbidExtra):
    """Model containing genomic and transcript exon data."""

    gene: StrictStr
    chr: StrictStr
    exon_start: Exon | None = None
    exon_end: Exon | None = None
    transcript: StrictStr
    strand: Strand

Potential Impact

No response

Additional context

No response

Contribution

Yes, I can create a PR for this feature.

ahwagner commented 4 months ago

Not sure I understand the proposed model here. An exon model should not have offsets; offsets are used for describing a junction in a fusion, not an exon in a transcript.

korikuzma commented 4 months ago

@ahwagner So should offsets be calculated in FUSOR? I was wondering this yesterday when talking with @jarbesfeld

korikuzma commented 4 months ago

@ahwagner the only reason why we have offsets is because this value is used in FUSOR, so I would like to remove if it makes sense to you

jsstevenson commented 4 months ago
class Exon(BaseException):

autocomplete gotcha!

korikuzma commented 4 months ago

ug

jsstevenson commented 4 months ago

I'm having a bit of trouble understanding the ask -- are genomic_start and genomic_end the genomic coordinates corresponding to the start and end of a given exon? Do we not need to provide the exon number itself (ie the index of the exon)?

ahwagner commented 4 months ago

I'm having a bit of trouble understanding the ask -- are genomic_start and genomic_end the genomic coordinates corresponding to the start and end of a given exon? Do we not need to provide the exon number itself (ie the index of the exon)?

I also had this question, and am seeking clarification on this from the CIViC devs.

korikuzma commented 4 months ago

@ahwagner @jsstevenson sorry, I did not provide an update to the models I started working on

korikuzma commented 4 months ago

I didn't update genomic yet, but it would look similar

class Exon(BaseModelForbidExtra):
    """Model for representing exon data"""

    number: StrictInt
    genomic_start: StrictInt = Field(..., description="Genomic position that aligns to the start of the exon")
    genomic_end: StrictInt = Field(..., description="Genomic position that aligns to the end of the exon")
    offset_start: StrictInt | None = 0
    offset_end: StrictInt | None = 0

class TranscriptExonData(BaseModelForbidExtra):
    """Model containing transcript exon data."""

    transcript: StrictStr
    exon: Exon
    gene: StrictStr
    chr: StrictStr
    strand: Strand

    model_config = ConfigDict(
        json_schema_extra={
            "example": {
                "chr": "NC_000001.11",
                "gene": "TPM3",
                "exon": {
                    "number": 1,
                    "genomic_start": 154191901,
                    "genomic_end": 154192135,
                },
                "transcript": "NM_152263.3",
                "strand": Strand.NEGATIVE,
            }
        }
    )
korikuzma commented 4 months ago

I had asked @jarbesfeld if transcript coords are needed and he said no (for FUSOR).

ahwagner commented 4 months ago

I think that it makes sense to keep cool-seq-tool focused on context translation operations; given a transcript and coordinate on that transcript, we should be able to map down to the genome using an alignment from UTA. Given a genomic coordinate and a transcript, we should be able to map back up to the transcript coordinates.

I read through the documentation for this function, to ensure I understood its intent.

Provide capabilities for mapping transcript exon representation to/from genomic coordinate representation.

I think we should be clear what is meant by "exon representation" in this definition, which I assume follows the VICC exon-based transcript segment representation.

Exon representation here is defined by the VICC spec, so should be compatible with that. Don't have time to do a full write-up here, but let's set aside a little time to discuss what we are going to use this method for.

Are the proposed models representative of the result object from this query?

jsstevenson commented 4 months ago

Tagging #224 -- I think if we are going to make breaking changes in this module, it might be nice to cover other desired changes in the same sprint

korikuzma commented 4 months ago

@jsstevenson Ya, a lot of it will be cleaned up in this issue. I'll do my best to split it out into smaller PRs.

jsstevenson commented 4 months ago

I'm a little confused about this Exon class:

class Exon(BaseModelForbidExtra):
    """Model for representing exon data"""

    number: StrictInt
    genomic_start: StrictInt = Field(..., description="Genomic position that aligns to the start of the exon")
    genomic_end: StrictInt = Field(..., description="Genomic position that aligns to the end of the exon")
    offset_start: StrictInt | None = 0
    offset_end: StrictInt | None = 0

Currently, the mapper returns a genomic position, the exon within which that position lies, and the offset corresponding to where within that exon the genomic position lies (e.g. offset=0 means that the genomic position is the same place as the start of the exon). It seems like this new model instead just returns the genomic coordinates for the start and the end of a given exon. Is that... relevant to mapping? I would think that a general lookup of start/end coords for exons is sort of different than mapping a genomic location to a VICC-style exon-offset representation.

Second -- why do we need both offset_start and offset_end? Are we representing a range within the exon?

jarbesfeld commented 4 months ago

@jsstevenson For the second question, I think we would need both offset_start and offset_end as optional attributes, with only one attribute being used depending on if the breakpoint represents the start or end of a sequence. This would be consistent with the explanation below, which describes when to use the offset parameter when describing breakpoints in the context of fusions:

Under these conditions, a genomic coordinate falling within an exon represents a positive offset from the beginning of that exon if it is starting at that coordinate, or a negative offset from the end of that exon if it is ending at that coordinate. Likewise, an RNA that includes exon-adjacent intronic sequence would have a genomic coordinate with a negative offset from the beginning of that exon if it is starting at that coordinate, or a positive offset from the end of that exon if it is ending at that coordinate.

jsstevenson commented 4 months ago

I think we would need both offset_start and offset_end as optional attributes, with only one attribute being used depending on if the breakpoint represents the start or end of a sequence.

Mmmm. From an API design perspective, rather than providing two mutually-exclusive properties where the one being set implies an additional piece of information, I wonder if we'd rather have a single property plus a property make that additional piece of information explicit.

ahwagner commented 4 months ago

I'm a little confused about this Exon class:

class Exon(BaseModelForbidExtra):
    """Model for representing exon data"""

    number: StrictInt
    genomic_start: StrictInt = Field(..., description="Genomic position that aligns to the start of the exon")
    genomic_end: StrictInt = Field(..., description="Genomic position that aligns to the end of the exon")
    offset_start: StrictInt | None = 0
    offset_end: StrictInt | None = 0

Currently, the mapper returns a genomic position, the exon within which that position lies, and the offset corresponding to where within that exon the genomic position lies (e.g. offset=0 means that the genomic position is the same place as the start of the exon). It seems like this new model instead just returns the genomic coordinates for the start and the end of a given exon. Is that... relevant to mapping? I would think that a general lookup of start/end coords for exons is sort of different than mapping a genomic location to a VICC-style exon-offset representation.

Second -- why do we need both offset_start and offset_end? Are we representing a range within the exon?

Yes, these are the concerns I had with this. I think that the CIViC team had some other requirements beyond a simple mapping for which they needed to look up both ends of an exon (maybe @susannasiebert can comment on this), but I agree with @jsstevenson that doesn't seem like an intuitive response for a genomic->exon translation method.

Perhaps, once we understand what the CIViC team needed a full exon model for we can design distinct endpoints that accommodate that use case.

Mmmm. From an API design perspective, rather than providing two mutually-exclusive properties where the one being set implies an additional piece of information, I wonder if we'd rather have a single property plus a property make that additional piece of information explicit.

Yes. For genomic -> exon mappings, I think we would need the following information as 3 properties, as described in https://github.com/cancervariants/fusor/pull/172#issuecomment-2257149730:

korikuzma commented 4 months ago

@jsstevenson @ahwagner sorry for the confusion. For the past few days, @jarbesfeld and I have been discussing on how to represent the output in Cool-Seq-Tool. We have not been updating the schemas due to back and forth conversation. We believe Cool-Seq-Tool should not follow VICC or GA4GH standards. Currently, the only GA4GH convention that is followed in Cool-Seq-Tool is returning inter-residue coordinates. We envision Cool-Seq-Tool to eventually live in Biocommons, so we believe that VICC/GA4GH standards should be implemented in downstream applications (such as the Variation Normalizer and FUSOR/CurFu).

Here is an example of input/output in the Cool-Seq-Tool mapper. This example is on the positive strand and no offset.

UTA uses inter-residue based coordinates. tx_start_i/alt_start_i are inclusive, but tx_end_i/alt_end_i are not inclusive (this is a bug that I have to fix).

uta=> select * from tx_exon_aln_v where tx_ac = 'NM_003390.3' and alt_ac = 'NC_000011.10' and alt_aln_method='splign' and ord in (1, 10);
+------+-------------+--------------+----------------+------------+-----+------------+----------+-------------+-----------+-------+---------+----------+----------------+-----------------+------------+-------------+-------------+
| hgnc |    tx_ac    |    alt_ac    | alt_aln_method | alt_strand | ord | tx_start_i | tx_end_i | alt_start_i | alt_end_i | cigar | tx_aseq | alt_aseq | tx_exon_set_id | alt_exon_set_id | tx_exon_id | alt_exon_id | exon_aln_id |
+------+-------------+--------------+----------------+------------+-----+------------+----------+-------------+-----------+-------+---------+----------+----------------+-----------------+------------+-------------+-------------+
| WEE1 | NM_003390.3 | NC_000011.10 | splign         |          1 |   1 |        829 |     1035 |     9575887 |   9576093 | 206=  |         |          |          77211 |          775379 |     706112 |     6609590 |     4325817 |
| WEE1 | NM_003390.3 | NC_000011.10 | splign         |          1 |  10 |       2040 |     3359 |     9588448 |   9589767 | 1319= |         |          |          77211 |          775379 |     706121 |     6609599 |     4325813 |
+------+-------------+--------------+----------------+------------+-----+------------+----------+-------------+-----------+-------+---------+----------+----------------+-----------------+------------+-------------+-------------+

This

await mapper.genomic_to_transcript_exon_coordinates(
  alt_ac="NC_000011.10",
  start=9575888,
  end=9589767,
  strand=Strand.POSITIVE,
  transcript="NM_003390.3",
  residue_mode=ResidueMode.RESIDUE
)

and

await mapper.transcript_to_genomic_coordinates(
  transcript="NM_003390.3",
  exon_start=2,
  exon_end=11,
  residue_mode=ResidueMode.RESIDUE
)

will both return:

{
  "gene": "WEE1",
  "alt_ac": "NC_000011.10",
  "exon_start": {
      "ord": 1,
      "tx_start_i": 829,
      "tx_end_i": 1035,
      "alt_start_i": 9575887,
      "alt_end_i": 9576093,
  },
  "exon_start_offset": 0,    # input.start - alt_start_i - 1
  "exon_end": {
      "ord": 10,
      "tx_start_i": 2040,
      "tx_end_i": 3359,
      "alt_start_i": 9588448,
      "alt_end_i": 9589767
  },
  "exon_end_offset": 0,  # input.end - alt_end_i
  "tx_ac": "NM_003390.3",
  "strand": Strand.POSITIVE,
}

exon_start and exon_end are being pulled directly from UTA (no modifications).

Yes. For genomic -> exon mappings, I think we would need the following information as 3 properties, as described in https://github.com/cancervariants/fusor/pull/172#issuecomment-2257149730:

genomic coordinate
target transcript
5' or 3' junction direction

@ahwagner @jsstevenson Does this help resolve any confusion? What do you think of the output? We weren't sure on if transcript coordinates should be included.

ahwagner commented 4 months ago

No–I feel like there are two conversations going on here. @jsstevenson and I are asking why we need an exon model for genomic <-> exon coordinate conversion. This is independent of schema and/or what data standards we want to apply.

korikuzma commented 4 months ago

@ahwagner There was an earlier comment on assuming we were using VICC exon representation, so that's why we made a comment about the standards.

Please ignore the outdated models in the PR description and from the July 25 comment. The previous Exon model was a quick thought. In our last comment, exon_start and exon_end replace this model.

ahwagner commented 4 months ago

The issue I'm trying to gain clarity on is what operation we are performing here. What is the use case for this feature? When you want to convert a transcript segment to two genomic coordinates on a chromosome, or when you want to convert two genomic coordinates on a chromosome to a transcript segment? What if you only have one coordinate? In most real-world cases, we will have one coordinate from each of the ends of two transcripts that form a junction, not a full segment. Therefore, I was expecting this method to convert a genomic coordinate and direction to an exon-based representation of a junction, and vice versa.

jsstevenson commented 4 months ago

Do we want to meet to go over the proposed overhaul? (Conveniently I am out of the office today but probably could meet virtually at 2)

korikuzma commented 4 months ago

The issue I'm trying to gain clarity on is what operation we are performing here.

genomic_to_transcript_exon_coordinates and transcript_to_genomic_coordinates will return aligned transcript exon data (transcript / genomic positions for the start and end of the exon + corresponding exon number); this is also lifted to GRCh38 assembly.

Given same example above, the previous output would give:

{
        "chr": "NC_000011.10",
        "gene": "WEE1",
        "start": 9575887,
        "exon_start": 2,
        "exon_start_offset": -205,
        "end": 9589767,
        "exon_end": 11,
        "exon_end_offset": 1318,
        "transcript": "NM_003390.3",
        "strand": Strand.POSITIVE,
    }

This is the suggested output:

{
  "gene": "WEE1",
  "alt_ac": "NC_000011.10",
  "exon_start": {
      "ord": 1,
      "tx_start_i": 829,
      "tx_end_i": 1035,
      "alt_start_i": 9575887,
      "alt_end_i": 9576093,
  },
  "exon_start_offset": 0,    # input.start - alt_start_i - 1
  "exon_end": {
      "ord": 10,
      "tx_start_i": 2040,
      "tx_end_i": 3359,
      "alt_start_i": 9588448,
      "alt_end_i": 9589767
  },
  "exon_end_offset": 0,  # input.end - alt_end_i
  "tx_ac": "NM_003390.3",
  "strand": Strand.POSITIVE,
}

Things to point out:

What is the use case for this feature?

Some use cases: I have genomic position(s). I want to know: What exon does this genomic position occur on if aligned to a given transcript? I have an exon number. I want to know: What are the genomic positions for the start and end of this exon?

When you want to convert a transcript segment to two genomic coordinates on a chromosome, or when you want to convert two genomic coordinates on a chromosome to a transcript segment?

@jarbesfeld : This does allow both. That's why we have the genomic_to_transcript_exon_coordinates and transcript_to_genomic_coordinates methods.

What if you only have one coordinate?

genomic_to_transcript_exon_coordinates: You can provide start or end, rather than both transcript_to_genomic_coordinates: You can provide exon_start or exon_end, rather than both

Therefore, I was expecting this method to convert a genomic coordinate and direction to an exon-based representation of a junction, and vice versa.

@jarbesfeld : Our example output has this since we have the exon number + offset, so it's a question of how to structure the data.

korikuzma commented 4 months ago

I'm going to close this for now until CIViC team confirms if this is actually needed. I will make new issues for changes requested by @ahwagner