Closed deannachurch closed 7 months ago
UTA needs an update. The most recent version is uta_20180821. Although NM_007194.4 is in UTA, the 38 alignment for it is not. This typically means that it wasn't in the gff3 files at the time of the snapshot, or that it failed alignment criteria and was rejected. The criteria were selected with input from Terence M. in an effort to match alignment filtering at NCBI. NM_007194.4 became current on June 13, 2018, so perhaps the alignments didn't exist in Aug 2018.
I am looking for funding to automate the construction of UTA so that it doesn't fall so far behind.
Additional comments:
1) UTA does contain the alignment of .3 to 38. So, if you know that that exon structures and alignments are consistent in this exon, you can probably get a way with using .3.
2) You can see available transcripts and alignments like this:
$ export PGPASSWORD=uta_public
$ psql -h uta.biocommons.org -d uta -U uta_public
uta_public@uta/uta=> set search_path = uta_20180821;
uta_public@uta/uta=> select tx_ac,alt_ac,alt_aln_method from exon_set where tx_ac ~'^NM_007194' order by 1,2;
┌─────────────┬──────────────┬────────────────┐
│ tx_ac │ alt_ac │ alt_aln_method │
├─────────────┼──────────────┼────────────────┤
│ NM_007194.3 │ AC_000154.1 │ splign │
│ NM_007194.3 │ NC_000022.10 │ splign │
│ NM_007194.3 │ NC_000022.10 │ blat │
│ NM_007194.3 │ NC_000022.11 │ splign │
│ NM_007194.3 │ NC_018933.2 │ splign │
│ NM_007194.3 │ NG_008150.1 │ splign │
│ NM_007194.3 │ NM_007194.3 │ transcript │
│ NM_007194.4 │ NC_000022.10 │ splign │
│ NM_007194.4 │ NM_007194.4 │ transcript │
└─────────────┴──────────────┴────────────────┘
(9 rows)
Hi Reece, Thanks for the update. It turns out my bigger problem is that I need to project alignments onto NM_001257387.2, which does not seem to be in 37, only .1 is in 37. These two seem a bit different - at least based on length (.1 is 1976 bases and .2 is 1958 bases) Is there another source I can get the UTA updated from? Is it difficult to install UTA and then add information to it? Thanks for your help- I appreciate this is unfunded but it is super valuable. If there is anything I can do to help (letters, etc) please let me know.
best, -deanna
Updating UTA is a pain right now, which is why it's languished (much to my chagrin). I wouldn't wish that process on anyone. (However, instructions do exist if you're feeling intrepid. It refers to hosts within Invitae, but the process would be the same for your own installations.)
As for the offer of help, thanks! I'll follow up by email.
Thanks Reece- appreciate the rapid response on the update.
Note to self for future UTA update: I had hoped to use the gff3 files as-is for UTA cigar strings. This won't be possible because the gff3 files don't denote mismatches, which hgvs uses to correct for reference sequence differences.
Example: GCF_000001405.28_knownrefseq_alignments.gff3 (mirrored on 2019-09-05) contains:
NC_000010.11 RefSeq cDNA_match 87863438 87864548 1104.71 + . ID=d73f8942-0138-46b9-8e95-56e7ebc1c240;Target=NM_000314.6 1 1110 +;gap_count=1;identity=0.99977;idty=0.9982;num_ident=8700;num_mismatch=1;pct_coverage=100;pct_identity_gap=99.977;pct_identity_ungap=99.9885;Gap=M666 D1 M444
UTA contains the cigar string M666 I1 39=1X404=
. The I/D swap is because UTA is transcript-centric. The more interesting difference is the 39=1X404=
/ M444
difference. The length is the same, but the uta alignment correctly picked up that there is a mismatch. Note that NCBI's gff3 shows num_mismatch=1
.
The upshot is that UTA will need to continue aligning regions in order to pick up mismatches.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for 7 days with no activity.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been stalled for 7 days with no activity.
Hi- I'm doing some variant mapping. Starting with this:
I get: NM_007194.4(CHEK2):c.1611T>A 1611 T
All is good. I can get the genomic location for GRCh37
I can get a location on GRCh37:
NC_000022.10:g.29083906A>T
But when trying to get the location on GRCh38
I get: HGVSDataNotAvailableError: No alignments for NM_007194.4 in GRCh38 using splign
Which is surprising as NM_007194.4 is still the reference- and is in fact the MANE transcript. As I'm trying to go back and forth between some RefSeq and Ensembl transcripts based on MANE, GRCh38 is preferable- though I will try with GRCh37 for now.
thanks!