Closed jarbesfeld closed 1 month ago
@jarbesfeld can you let me know what is passed to cool-seq-tool?
I think it's just g -> tx seg with the values not changing, but want to confirm.
@korikuzma I think the input that is passed to cool-seq-tool is:
genomic_to_tx_segment(
chromosome = 1,
seg_end_genomic=1098742,
gene="C1orf159",
get_nearest_transcript_junction=True,
)
I will investigate this next sprint
@jarbesfeld The issue is that we always uppercase the input gene.
We do that in 3 places in ExonGenomicCoordsMapper
:
When removing these three instances:
r = await mapper.genomic_to_tx_segment(
chromosome="1",
seg_end_genomic=1098742,
gene="C1orf159",
get_nearest_transcript_junction=True
)
r.model_dump(exclude_none=True)
returns
{
"gene": "C1orf159",
"genomic_ac": "NC_000001.11",
"tx_ac": "NM_017891.5",
"seg_end": {
"exon_ord": 0,
"offset": 17317,
"genomic_location": {
"type": "SequenceLocation",
"sequenceReference": {
"type": "SequenceReference",
"refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO"
},
"start": 1098742
}
},
"errors": []
}
So we have two options: 1) add back gene-normalizer to always get HGNC symbol in public methods 2) assume gene argument is a valid, case-sensitive HGNC gene symbol. It's been kind of nice not having gene-normalizer in this tool (one less dependency to manage and makes it so that Cool-Seq-Tool is not tightly coupled to the gene-normalizer), so I'm leaning towards option 2. Downstream apps, such as fusion-curation and variation-normalizer should do this gene normalization. What are your thoughts?
FWIW, Alex and @anastasiabratulin looked at this a while ago and found that the case of "
*(I think that pattern should be right, the "orf" stands for "open read frame" and I think the numbers on each side refer to something like location and size)
@jsstevenson Ah, I did not know that.
A method that upper-cases unless the input matches that pattern could solve this problem.
A case where this would fail: MORF4 where the user provided morf4
. I thought maybe orf would require more than one digit, but found previous symbol C9orf2.
Update: I guess we could include the prefix so the regex could look like C\w+orf\d+
.
So I think it's best (and easiest) to leave it to downstream apps to handle gene normalization and have Cool-Seq-Tool expect a valid, case-sensitive HGNC gene symbol. @jsstevenson and @jarbesfeld Thoughts?
A case where this would fail: MORF4
Oh, yeah sorry, meant to clarify that I think it only happens in cases like r"\dorf\d"
. It wouldn't shock me if it was even more specific to r"^C\d+orf\d+$"
, should be fairly easy to check. Anastasia has spent a fair amount of time looking at this, I think.
Oh, and further caveat that this is only for human genes, but I think that's fine for UTA?
@korikuzma @jsstevenson I agree that while it seems to be an easy check that could be added in cool-seq-tool, it would be a benefit to potentially de-couple cool-seq-tool from gene-normalizer.
Thinking it over I would support having cool-seq-tool expect a processed gene symbol. There are other applications, like dcd-mapping, which do their own gene processing, so I think that step could be done in those implementations.
@jsstevenson Does this work for you? Or is there something we're missing?
I guess I'm not totally clear what the cost is -- presumably we'll need to do this exact string processing step (something like the below) in every downstream use case, so I don't know why we wouldn't do it here?
In [75]: def handle_case(gene):
...: pattern = re.compile(r"(.*)C(\d+|X|Y)ORF(\d+)(.*)", re.IGNORECASE)
...: match = pattern.findall(gene)
...: if match:
...: return f"{match[0][0].upper()}C{match[0][1].upper()}orf{match[0][2].upper()}{match[0][3].upper()}"
...: else:
...: return gene.upper()
alternatively -- why do we need to case-aware match the gene column in the sql query? As far as I can tell, there's a single case in tx_exon_aln_v
where it could hypothetically make a difference, but it's for alignments using the genebuild
method so we'd never fetch them anyway.
For downstream apps such as MAVE, Variation-Normalizer, and Fusion-Curation gene queries are passed to the gene-normalizer (and if not, I think they should be) to get the HGNC gene symbol. If we're already making this call and passing this value to Cool-Seq-Tool, it seems duplicative to make this check.
alternatively -- why do we need to case-aware match the gene column in the sql query? As far as I can tell, there's a single case in tx_exon_aln_v where it could hypothetically make a difference, but it's for alignments using the genebuild method so we'd never fetch them anyway.
Ah, so the issue isn't with UTA but rather the NCBI MANE summary data
For this particular issue, a simple solution (if we don't want to assume proper case) would be to change: https://github.com/GenomicMedLab/cool-seq-tool/blob/main/src/cool_seq_tool/sources/mane_transcript_mappings.py#L64
to
data = self.df.filter(pl.col("symbol").str.to_uppercase() == gene_symbol.upper())
Ah, so the issue isn't with UTA but rather the NCBI MANE summary data
d'oh, right
For this particular issue, a simple solution
Yeah, or we could just upcase that column in _load_mane_transcript_data
... this is a pretty small file so that should be a cheap operation?
@jarbesfeld we had talked previously about you tackling this issue. Does this still work for you?
Yes, I can tackle this issue
Describe the bug
When using the
genomic_to_tx_segment
method in cool-seq-tool, it returned an error stating that MANE transcript data could not be found for the gene C1orf159. However, there is a MANE transcript: https://www.ncbi.nlm.nih.gov/nuccore/NM_017891.Steps to reproduce
Expected behavior
The transcript NM_017891.5 would be selected
Current behavior
The following statement is outputted:
Unable to get MANE Transcript data for gene: C1ORF159
Possible reason(s)
No response
Suggested fix
No response
Branch, commit, and/or version
Main branch
Screenshots
No response
Environment details
Mac M1
Additional details
No response
Contribution
None