GRCh38 splign alignments available?

biocommons / uta

Universal Transcript Archive: comprehensive genome-transcript alignments; multiple transcript sources, versions, and alignment methods; available as a docker image

Apache License 2.0

62 stars 26 forks source link

GRCh38 splign alignments available? #223

Closed deannachurch closed 7 months ago

deannachurch commented 4 years ago

Hi- I'm doing some variant mapping. Starting with this:

v = hp.parse_hgvs_variant("NM_007194.4(CHEK2):c.1611T>A")
print(v)
print(v.posedit.pos.start)
print(v.posedit.edit.ref)

I get: NM_007194.4(CHEK2):c.1611T>A 1611 T

All is good. I can get the genomic location for GRCh37

I can get a location on GRCh37:

am37 = hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name='GRCh37', alt_aln_method='splign', replace_reference=True)
var_g37=am37.c_to_g(v)
print(var_g37)

NC_000022.10:g.29083906A>T

But when trying to get the location on GRCh38

am38 = hgvs.assemblymapper.AssemblyMapper(hdp, assembly_name='GRCh38', alt_aln_method='splign', replace_reference=True)
var_g38=am38.c_to_g(v)
print(var_g38)

I get: HGVSDataNotAvailableError: No alignments for NM_007194.4 in GRCh38 using splign

Which is surprising as NM_007194.4 is still the reference- and is in fact the MANE transcript. As I'm trying to go back and forth between some RefSeq and Ensembl transcripts based on MANE, GRCh38 is preferable- though I will try with GRCh37 for now.

thanks!

reece commented 4 years ago

UTA needs an update. The most recent version is uta_20180821. Although NM_007194.4 is in UTA, the 38 alignment for it is not. This typically means that it wasn't in the gff3 files at the time of the snapshot, or that it failed alignment criteria and was rejected. The criteria were selected with input from Terence M. in an effort to match alignment filtering at NCBI. NM_007194.4 became current on June 13, 2018, so perhaps the alignments didn't exist in Aug 2018.

I am looking for funding to automate the construction of UTA so that it doesn't fall so far behind.

reece commented 4 years ago

Additional comments:

1) UTA does contain the alignment of .3 to 38. So, if you know that that exon structures and alignments are consistent in this exon, you can probably get a way with using .3.

2) You can see available transcripts and alignments like this:

$ export PGPASSWORD=uta_public
$ psql -h uta.biocommons.org -d uta -U uta_public
uta_public@uta/uta=> set search_path = uta_20180821;
uta_public@uta/uta=> select tx_ac,alt_ac,alt_aln_method from exon_set where tx_ac ~'^NM_007194' order by 1,2;
┌─────────────┬──────────────┬────────────────┐
│    tx_ac    │    alt_ac    │ alt_aln_method │
├─────────────┼──────────────┼────────────────┤
│ NM_007194.3 │ AC_000154.1  │ splign         │
│ NM_007194.3 │ NC_000022.10 │ splign         │
│ NM_007194.3 │ NC_000022.10 │ blat           │
│ NM_007194.3 │ NC_000022.11 │ splign         │
│ NM_007194.3 │ NC_018933.2  │ splign         │
│ NM_007194.3 │ NG_008150.1  │ splign         │
│ NM_007194.3 │ NM_007194.3  │ transcript     │
│ NM_007194.4 │ NC_000022.10 │ splign         │
│ NM_007194.4 │ NM_007194.4  │ transcript     │
└─────────────┴──────────────┴────────────────┘
(9 rows)

deannachurch commented 4 years ago

Hi Reece, Thanks for the update. It turns out my bigger problem is that I need to project alignments onto NM_001257387.2, which does not seem to be in 37, only .1 is in 37. These two seem a bit different - at least based on length (.1 is 1976 bases and .2 is 1958 bases) Is there another source I can get the UTA updated from? Is it difficult to install UTA and then add information to it? Thanks for your help- I appreciate this is unfunded but it is super valuable. If there is anything I can do to help (letters, etc) please let me know.

best, -deanna

reece commented 4 years ago

Updating UTA is a pain right now, which is why it's languished (much to my chagrin). I wouldn't wish that process on anyone. (However, instructions do exist if you're feeling intrepid. It refers to hosts within Invitae, but the process would be the same for your own installations.)

As for the offer of help, thanks! I'll follow up by email.

deannachurch commented 4 years ago

Thanks Reece- appreciate the rapid response on the update.

reece commented 4 years ago

Note to self for future UTA update: I had hoped to use the gff3 files as-is for UTA cigar strings. This won't be possible because the gff3 files don't denote mismatches, which hgvs uses to correct for reference sequence differences.

Example: GCF_000001405.28_knownrefseq_alignments.gff3 (mirrored on 2019-09-05) contains:

NC_000010.11    RefSeq  cDNA_match  87863438    87864548    1104.71 +   .   ID=d73f8942-0138-46b9-8e95-56e7ebc1c240;Target=NM_000314.6 1 1110 +;gap_count=1;identity=0.99977;idty=0.9982;num_ident=8700;num_mismatch=1;pct_coverage=100;pct_identity_gap=99.977;pct_identity_ungap=99.9885;Gap=M666 D1 M444

UTA contains the cigar string M666 I1 39=1X404=. The I/D swap is because UTA is transcript-centric. The more interesting difference is the 39=1X404= / M444 difference. The length is the same, but the uta alignment correctly picked up that there is a mismatch. Note that NCBI's gff3 shows num_mismatch=1.

The upshot is that UTA will need to continue aligning regions in order to pick up mismatches.

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 11 months ago

This issue was closed because it has been stalled for 7 days with no activity.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 7 months ago

This issue was closed because it has been stalled for 7 days with no activity.