Closed korikuzma closed 4 months ago
Not sure I understand the proposed model here. An exon model should not have offsets; offsets are used for describing a junction in a fusion, not an exon in a transcript.
@ahwagner So should offsets be calculated in FUSOR? I was wondering this yesterday when talking with @jarbesfeld
@ahwagner the only reason why we have offsets is because this value is used in FUSOR, so I would like to remove if it makes sense to you
class Exon(BaseException):
autocomplete gotcha!
ug
I'm having a bit of trouble understanding the ask -- are genomic_start
and genomic_end
the genomic coordinates corresponding to the start and end of a given exon? Do we not need to provide the exon number itself (ie the index of the exon)?
I'm having a bit of trouble understanding the ask -- are
genomic_start
andgenomic_end
the genomic coordinates corresponding to the start and end of a given exon? Do we not need to provide the exon number itself (ie the index of the exon)?
I also had this question, and am seeking clarification on this from the CIViC devs.
@ahwagner @jsstevenson sorry, I did not provide an update to the models I started working on
I didn't update genomic yet, but it would look similar
class Exon(BaseModelForbidExtra):
"""Model for representing exon data"""
number: StrictInt
genomic_start: StrictInt = Field(..., description="Genomic position that aligns to the start of the exon")
genomic_end: StrictInt = Field(..., description="Genomic position that aligns to the end of the exon")
offset_start: StrictInt | None = 0
offset_end: StrictInt | None = 0
class TranscriptExonData(BaseModelForbidExtra):
"""Model containing transcript exon data."""
transcript: StrictStr
exon: Exon
gene: StrictStr
chr: StrictStr
strand: Strand
model_config = ConfigDict(
json_schema_extra={
"example": {
"chr": "NC_000001.11",
"gene": "TPM3",
"exon": {
"number": 1,
"genomic_start": 154191901,
"genomic_end": 154192135,
},
"transcript": "NM_152263.3",
"strand": Strand.NEGATIVE,
}
}
)
I had asked @jarbesfeld if transcript coords are needed and he said no (for FUSOR).
I think that it makes sense to keep cool-seq-tool focused on context translation operations; given a transcript and coordinate on that transcript, we should be able to map down to the genome using an alignment from UTA. Given a genomic coordinate and a transcript, we should be able to map back up to the transcript coordinates.
I read through the documentation for this function, to ensure I understood its intent.
Provide capabilities for mapping transcript exon representation to/from genomic coordinate representation.
I think we should be clear what is meant by "exon representation" in this definition, which I assume follows the VICC exon-based transcript segment representation.
Exon representation here is defined by the VICC spec, so should be compatible with that. Don't have time to do a full write-up here, but let's set aside a little time to discuss what we are going to use this method for.
Are the proposed models representative of the result object from this query?
Tagging #224 -- I think if we are going to make breaking changes in this module, it might be nice to cover other desired changes in the same sprint
@jsstevenson Ya, a lot of it will be cleaned up in this issue. I'll do my best to split it out into smaller PRs.
I'm a little confused about this Exon
class:
class Exon(BaseModelForbidExtra):
"""Model for representing exon data"""
number: StrictInt
genomic_start: StrictInt = Field(..., description="Genomic position that aligns to the start of the exon")
genomic_end: StrictInt = Field(..., description="Genomic position that aligns to the end of the exon")
offset_start: StrictInt | None = 0
offset_end: StrictInt | None = 0
Currently, the mapper returns a genomic position, the exon within which that position lies, and the offset corresponding to where within that exon the genomic position lies (e.g. offset=0
means that the genomic position is the same place as the start of the exon). It seems like this new model instead just returns the genomic coordinates for the start and the end of a given exon. Is that... relevant to mapping? I would think that a general lookup of start/end coords for exons is sort of different than mapping a genomic location to a VICC-style exon-offset representation.
Second -- why do we need both offset_start
and offset_end
? Are we representing a range within the exon?
@jsstevenson For the second question, I think we would need both offset_start
and offset_end
as optional attributes, with only one attribute being used depending on if the breakpoint represents the start or end of a sequence. This would be consistent with the explanation below, which describes when to use the offset parameter when describing breakpoints in the context of fusions:
Under these conditions, a genomic coordinate falling within an exon represents a positive offset from the beginning of that exon if it is starting at that coordinate, or a negative offset from the end of that exon if it is ending at that coordinate. Likewise, an RNA that includes exon-adjacent intronic sequence would have a genomic coordinate with a negative offset from the beginning of that exon if it is starting at that coordinate, or a positive offset from the end of that exon if it is ending at that coordinate.
I think we would need both offset_start and offset_end as optional attributes, with only one attribute being used depending on if the breakpoint represents the start or end of a sequence.
Mmmm. From an API design perspective, rather than providing two mutually-exclusive properties where the one being set implies an additional piece of information, I wonder if we'd rather have a single property plus a property make that additional piece of information explicit.
I'm a little confused about this
Exon
class:class Exon(BaseModelForbidExtra): """Model for representing exon data""" number: StrictInt genomic_start: StrictInt = Field(..., description="Genomic position that aligns to the start of the exon") genomic_end: StrictInt = Field(..., description="Genomic position that aligns to the end of the exon") offset_start: StrictInt | None = 0 offset_end: StrictInt | None = 0
Currently, the mapper returns a genomic position, the exon within which that position lies, and the offset corresponding to where within that exon the genomic position lies (e.g.
offset=0
means that the genomic position is the same place as the start of the exon). It seems like this new model instead just returns the genomic coordinates for the start and the end of a given exon. Is that... relevant to mapping? I would think that a general lookup of start/end coords for exons is sort of different than mapping a genomic location to a VICC-style exon-offset representation.Second -- why do we need both
offset_start
andoffset_end
? Are we representing a range within the exon?
Yes, these are the concerns I had with this. I think that the CIViC team had some other requirements beyond a simple mapping for which they needed to look up both ends of an exon (maybe @susannasiebert can comment on this), but I agree with @jsstevenson that doesn't seem like an intuitive response for a genomic->exon translation method.
Perhaps, once we understand what the CIViC team needed a full exon model for we can design distinct endpoints that accommodate that use case.
Mmmm. From an API design perspective, rather than providing two mutually-exclusive properties where the one being set implies an additional piece of information, I wonder if we'd rather have a single property plus a property make that additional piece of information explicit.
Yes. For genomic -> exon mappings, I think we would need the following information as 3 properties, as described in https://github.com/cancervariants/fusor/pull/172#issuecomment-2257149730:
@jsstevenson @ahwagner sorry for the confusion. For the past few days, @jarbesfeld and I have been discussing on how to represent the output in Cool-Seq-Tool. We have not been updating the schemas due to back and forth conversation. We believe Cool-Seq-Tool should not follow VICC or GA4GH standards. Currently, the only GA4GH convention that is followed in Cool-Seq-Tool is returning inter-residue coordinates. We envision Cool-Seq-Tool to eventually live in Biocommons, so we believe that VICC/GA4GH standards should be implemented in downstream applications (such as the Variation Normalizer and FUSOR/CurFu).
Here is an example of input/output in the Cool-Seq-Tool mapper. This example is on the positive strand and no offset.
UTA uses inter-residue based coordinates. tx_start_i
/alt_start_i
are inclusive, but tx_end_i
/alt_end_i
are not inclusive (this is a bug that I have to fix).
uta=> select * from tx_exon_aln_v where tx_ac = 'NM_003390.3' and alt_ac = 'NC_000011.10' and alt_aln_method='splign' and ord in (1, 10);
+------+-------------+--------------+----------------+------------+-----+------------+----------+-------------+-----------+-------+---------+----------+----------------+-----------------+------------+-------------+-------------+
| hgnc | tx_ac | alt_ac | alt_aln_method | alt_strand | ord | tx_start_i | tx_end_i | alt_start_i | alt_end_i | cigar | tx_aseq | alt_aseq | tx_exon_set_id | alt_exon_set_id | tx_exon_id | alt_exon_id | exon_aln_id |
+------+-------------+--------------+----------------+------------+-----+------------+----------+-------------+-----------+-------+---------+----------+----------------+-----------------+------------+-------------+-------------+
| WEE1 | NM_003390.3 | NC_000011.10 | splign | 1 | 1 | 829 | 1035 | 9575887 | 9576093 | 206= | | | 77211 | 775379 | 706112 | 6609590 | 4325817 |
| WEE1 | NM_003390.3 | NC_000011.10 | splign | 1 | 10 | 2040 | 3359 | 9588448 | 9589767 | 1319= | | | 77211 | 775379 | 706121 | 6609599 | 4325813 |
+------+-------------+--------------+----------------+------------+-----+------------+----------+-------------+-----------+-------+---------+----------+----------------+-----------------+------------+-------------+-------------+
This
await mapper.genomic_to_transcript_exon_coordinates(
alt_ac="NC_000011.10",
start=9575888,
end=9589767,
strand=Strand.POSITIVE,
transcript="NM_003390.3",
residue_mode=ResidueMode.RESIDUE
)
and
await mapper.transcript_to_genomic_coordinates(
transcript="NM_003390.3",
exon_start=2,
exon_end=11,
residue_mode=ResidueMode.RESIDUE
)
will both return:
{
"gene": "WEE1",
"alt_ac": "NC_000011.10",
"exon_start": {
"ord": 1,
"tx_start_i": 829,
"tx_end_i": 1035,
"alt_start_i": 9575887,
"alt_end_i": 9576093,
},
"exon_start_offset": 0, # input.start - alt_start_i - 1
"exon_end": {
"ord": 10,
"tx_start_i": 2040,
"tx_end_i": 3359,
"alt_start_i": 9588448,
"alt_end_i": 9589767
},
"exon_end_offset": 0, # input.end - alt_end_i
"tx_ac": "NM_003390.3",
"strand": Strand.POSITIVE,
}
exon_start
and exon_end
are being pulled directly from UTA (no modifications).
Yes. For genomic -> exon mappings, I think we would need the following information as 3 properties, as described in https://github.com/cancervariants/fusor/pull/172#issuecomment-2257149730:
genomic coordinate target transcript 5' or 3' junction direction
exon_start_offset
and exon_end_offset
tx_ac
@ahwagner @jsstevenson Does this help resolve any confusion? What do you think of the output? We weren't sure on if transcript coordinates should be included.
No–I feel like there are two conversations going on here. @jsstevenson and I are asking why we need an exon model for genomic <-> exon coordinate conversion. This is independent of schema and/or what data standards we want to apply.
@ahwagner There was an earlier comment on assuming we were using VICC exon representation, so that's why we made a comment about the standards.
Please ignore the outdated models in the PR description and from the July 25 comment. The previous Exon
model was a quick thought. In our last comment, exon_start
and exon_end
replace this model.
The issue I'm trying to gain clarity on is what operation we are performing here. What is the use case for this feature? When you want to convert a transcript segment to two genomic coordinates on a chromosome, or when you want to convert two genomic coordinates on a chromosome to a transcript segment? What if you only have one coordinate? In most real-world cases, we will have one coordinate from each of the ends of two transcripts that form a junction, not a full segment. Therefore, I was expecting this method to convert a genomic coordinate and direction to an exon-based representation of a junction, and vice versa.
Do we want to meet to go over the proposed overhaul? (Conveniently I am out of the office today but probably could meet virtually at 2)
The issue I'm trying to gain clarity on is what operation we are performing here.
genomic_to_transcript_exon_coordinates
and transcript_to_genomic_coordinates
will return aligned transcript exon data (transcript / genomic positions for the start and end of the exon + corresponding exon number); this is also lifted to GRCh38 assembly.
Given same example above, the previous output would give:
{
"chr": "NC_000011.10",
"gene": "WEE1",
"start": 9575887,
"exon_start": 2,
"exon_start_offset": -205,
"end": 9589767,
"exon_end": 11,
"exon_end_offset": 1318,
"transcript": "NM_003390.3",
"strand": Strand.POSITIVE,
}
This is the suggested output:
{
"gene": "WEE1",
"alt_ac": "NC_000011.10",
"exon_start": {
"ord": 1,
"tx_start_i": 829,
"tx_end_i": 1035,
"alt_start_i": 9575887,
"alt_end_i": 9576093,
},
"exon_start_offset": 0, # input.start - alt_start_i - 1
"exon_end": {
"ord": 10,
"tx_start_i": 2040,
"tx_end_i": 3359,
"alt_start_i": 9588448,
"alt_end_i": 9589767
},
"exon_end_offset": 0, # input.end - alt_end_i
"tx_ac": "NM_003390.3",
"strand": Strand.POSITIVE,
}
Things to point out:
exon_start
and exon_end
now correspond to exon_start.ord
and exon_end.ord
. This is 0 based since that is what UTA uses. We will provide documentation on this that we are pulling directly from UTA and not modifying the response.chr
renamed to alt_ac
to be consistent with UTAstart
and end
can now be calculated from alt_start_i
and alt_end_i
with exon_start_offset
and exon_end_offset
. The previous example is kind of weird when it comes to 0 vs 1 based and @jarbesfeld pointed out we were calculating offsets wrong (see also #332). The suggested approach is nice because it uses inter-residue directly from UTA.What is the use case for this feature?
Some use cases: I have genomic position(s). I want to know: What exon does this genomic position occur on if aligned to a given transcript? I have an exon number. I want to know: What are the genomic positions for the start and end of this exon?
When you want to convert a transcript segment to two genomic coordinates on a chromosome, or when you want to convert two genomic coordinates on a chromosome to a transcript segment?
@jarbesfeld : This does allow both. That's why we have the genomic_to_transcript_exon_coordinates
and transcript_to_genomic_coordinates
methods.
What if you only have one coordinate?
genomic_to_transcript_exon_coordinates
: You can provide start or end, rather than both
transcript_to_genomic_coordinates
: You can provide exon_start or exon_end, rather than both
Therefore, I was expecting this method to convert a genomic coordinate and direction to an exon-based representation of a junction, and vice versa.
@jarbesfeld : Our example output has this since we have the exon number + offset, so it's a question of how to structure the data.
I'm going to close this for now until CIViC team confirms if this is actually needed. I will make new issues for changes requested by @ahwagner
Feature description
We should be returning both start and end coordinates for exons.
Use case
Griffith lab would like to use
ExonGenomicCoordsMapper
via Fusion Builder API and have said they need both coordinates.Proposed solution
GenomicData
to include start/end for both start_exon and end_exontranscript_to_genomic_coordinates
andgenomic_to_transcript_exon_coordinates
will use this updated modelAlternatives considered
No response
Implementation details
Thinking of updating the structure to something like this:
Potential Impact
No response
Additional context
No response
Contribution
Yes, I can create a PR for this feature.