GenomicMedLab / cool-seq-tool

https://coolseqtool.readthedocs.io
MIT License
4 stars 0 forks source link

MANE transcript data was not found for the gene C1orf159 #367

Closed jarbesfeld closed 1 month ago

jarbesfeld commented 2 months ago

Describe the bug

When using the genomic_to_tx_segment method in cool-seq-tool, it returned an error stating that MANE transcript data could not be found for the gene C1orf159. However, there is a MANE transcript: https://www.ncbi.nlm.nih.gov/nuccore/NM_017891.

Steps to reproduce

await self.fusor.transcript_segment_element(
            tx_to_genomic_coords=False,
            chromosome=1,
            seg_end_genomic=1098742,
            gene="C1orf159",
            get_nearest_transcript_junction=True,
        )

Expected behavior

The transcript NM_017891.5 would be selected

Current behavior

The following statement is outputted: Unable to get MANE Transcript data for gene: C1ORF159

Possible reason(s)

No response

Suggested fix

No response

Branch, commit, and/or version

Main branch

Screenshots

No response

Environment details

Mac M1

Additional details

No response

Contribution

None

korikuzma commented 2 months ago

@jarbesfeld can you let me know what is passed to cool-seq-tool?

korikuzma commented 2 months ago

I think it's just g -> tx seg with the values not changing, but want to confirm.

jarbesfeld commented 2 months ago

@korikuzma I think the input that is passed to cool-seq-tool is:

genomic_to_tx_segment(
        chromosome = 1,
        seg_end_genomic=1098742,
        gene="C1orf159",
        get_nearest_transcript_junction=True,
)
korikuzma commented 2 months ago

I will investigate this next sprint

korikuzma commented 1 month ago

@jarbesfeld The issue is that we always uppercase the input gene.

We do that in 3 places in ExonGenomicCoordsMapper:

When removing these three instances:

r = await mapper.genomic_to_tx_segment(
  chromosome="1",
  seg_end_genomic=1098742,
  gene="C1orf159",
  get_nearest_transcript_junction=True
)
r.model_dump(exclude_none=True)

returns

{
  "gene": "C1orf159",
  "genomic_ac": "NC_000001.11",
  "tx_ac": "NM_017891.5",
  "seg_end": {
    "exon_ord": 0,
    "offset": 17317,
    "genomic_location": {
      "type": "SequenceLocation",
      "sequenceReference": {
        "type": "SequenceReference",
        "refgetAccession": "SQ.Ya6Rs7DHhDeg7YaOSg1EoNi3U_nQ9SvO"
      },
      "start": 1098742
    }
  },
  "errors": []
}

So we have two options: 1) add back gene-normalizer to always get HGNC symbol in public methods 2) assume gene argument is a valid, case-sensitive HGNC gene symbol. It's been kind of nice not having gene-normalizer in this tool (one less dependency to manage and makes it so that Cool-Seq-Tool is not tightly coupled to the gene-normalizer), so I'm leaning towards option 2. Downstream apps, such as fusion-curation and variation-normalizer should do this gene normalization. What are your thoughts?

jsstevenson commented 1 month ago

FWIW, Alex and @anastasiabratulin looked at this a while ago and found that the case of "orf" is the only* time you have lower case alphabetic characters in gene symbols. A method that upper-cases unless the input matches that pattern could solve this problem.

*(I think that pattern should be right, the "orf" stands for "open read frame" and I think the numbers on each side refer to something like location and size)

korikuzma commented 1 month ago

@jsstevenson Ah, I did not know that.

A method that upper-cases unless the input matches that pattern could solve this problem.

A case where this would fail: MORF4 where the user provided morf4. I thought maybe orf would require more than one digit, but found previous symbol C9orf2.

Update: I guess we could include the prefix so the regex could look like C\w+orf\d+.

So I think it's best (and easiest) to leave it to downstream apps to handle gene normalization and have Cool-Seq-Tool expect a valid, case-sensitive HGNC gene symbol. @jsstevenson and @jarbesfeld Thoughts?

jsstevenson commented 1 month ago

A case where this would fail: MORF4

Oh, yeah sorry, meant to clarify that I think it only happens in cases like r"\dorf\d". It wouldn't shock me if it was even more specific to r"^C\d+orf\d+$", should be fairly easy to check. Anastasia has spent a fair amount of time looking at this, I think.

jsstevenson commented 1 month ago

Oh, and further caveat that this is only for human genes, but I think that's fine for UTA?

jarbesfeld commented 1 month ago

@korikuzma @jsstevenson I agree that while it seems to be an easy check that could be added in cool-seq-tool, it would be a benefit to potentially de-couple cool-seq-tool from gene-normalizer.

jarbesfeld commented 1 month ago

Thinking it over I would support having cool-seq-tool expect a processed gene symbol. There are other applications, like dcd-mapping, which do their own gene processing, so I think that step could be done in those implementations.

korikuzma commented 1 month ago

@jsstevenson Does this work for you? Or is there something we're missing?

jsstevenson commented 1 month ago

I guess I'm not totally clear what the cost is -- presumably we'll need to do this exact string processing step (something like the below) in every downstream use case, so I don't know why we wouldn't do it here?

In [75]: def handle_case(gene):
    ...:     pattern = re.compile(r"(.*)C(\d+|X|Y)ORF(\d+)(.*)", re.IGNORECASE)
    ...:     match = pattern.findall(gene)
    ...:     if match:
    ...:         return f"{match[0][0].upper()}C{match[0][1].upper()}orf{match[0][2].upper()}{match[0][3].upper()}"
    ...:     else:
    ...:         return gene.upper()

alternatively -- why do we need to case-aware match the gene column in the sql query? As far as I can tell, there's a single case in tx_exon_aln_v where it could hypothetically make a difference, but it's for alignments using the genebuild method so we'd never fetch them anyway.

korikuzma commented 1 month ago

For downstream apps such as MAVE, Variation-Normalizer, and Fusion-Curation gene queries are passed to the gene-normalizer (and if not, I think they should be) to get the HGNC gene symbol. If we're already making this call and passing this value to Cool-Seq-Tool, it seems duplicative to make this check.

alternatively -- why do we need to case-aware match the gene column in the sql query? As far as I can tell, there's a single case in tx_exon_aln_v where it could hypothetically make a difference, but it's for alignments using the genebuild method so we'd never fetch them anyway.

Ah, so the issue isn't with UTA but rather the NCBI MANE summary data

For this particular issue, a simple solution (if we don't want to assume proper case) would be to change: https://github.com/GenomicMedLab/cool-seq-tool/blob/main/src/cool_seq_tool/sources/mane_transcript_mappings.py#L64

to

data = self.df.filter(pl.col("symbol").str.to_uppercase() == gene_symbol.upper())
jsstevenson commented 1 month ago

Ah, so the issue isn't with UTA but rather the NCBI MANE summary data

d'oh, right

For this particular issue, a simple solution

Yeah, or we could just upcase that column in _load_mane_transcript_data... this is a pretty small file so that should be a cheap operation?

korikuzma commented 1 month ago

@jarbesfeld we had talked previously about you tackling this issue. Does this still work for you?

jarbesfeld commented 1 month ago

Yes, I can tackle this issue