biocommons / uta

Universal Transcript Archive: comprehensive genome-transcript alignments; multiple transcript sources, versions, and alignment methods; available as a docker image
Apache License 2.0
62 stars 26 forks source link

Enable custom transcripts and custom genome-transcript alignments #220

Closed reece closed 4 years ago

reece commented 5 years ago

It would be useful to enable users to create custom transcripts and custom genome-transcript alignments.

The obvious solution is to load transcripts directly into UTA. This approach is not-ideal because it confounds the release process. A more flexible and scalable approach is to "layer" data sets, but this requires an API that federates this information.

@andreasprlic and @melissacline

reece commented 4 years ago

Generalizing the UTA interface is out of scope for the Invitae work. Therefore, the scope of this task to loading data into UTA. Creating a UTA REST interface will be picked up at another time.

reece commented 4 years ago

Closed in 7771502f8685b46a447b6b998d6dccd393723275.

QC Test: Andreas' list included a request to align NM_001807.4 to NC_000009.11 (GRCh37 chr 9). It turns out that this is already in more recent UTAs. I processed it anyway and used existing data as a positive control/comparison. Those data are below. (tl;dr: genomic exon coordinates from splign-manual, splign, and blat agree exactly.)

┌─[ RECORD 1 ]───┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ hgnc           │ CEL                                                                                                                                                                                                                         │
│ cds_md5        │ 519aa42c4a2192f9ef015b75224d81f7                                                                                                                                                                                            │
│ es_fingerprint │ c16d68bfea4aa9c4f7876cff22afd5b9                                                                                                                                                                                            │
│ tx_ac          │ NM_001807.4                                                                                                                                                                                                                 │
│ alt_ac         │ NC_000009.11                                                                                                                                                                                                                │
│ alt_aln_method │ splign                                                                                                                                                                                                                      │
│ alt_strand     │ 1                                                                                                                                                                                                                           │
│ exon_set_id    │ 208709                                                                                                                                                                                                                      │
│ n_exons        │ 11                                                                                                                                                                                                                          │
│ se_i           │ 135937364,135937455;135939790,135939941;135940026,135940149;135940426,135940624;135941916,135942047;135942224,135942332;135942474,135942592;135944058,135944245;135944442,135944646;135945847,135946045;135946373,135947250 │
│ starts_i       │ {135937364,135939790,135940026,135940426,135941916,135942224,135942474,135944058,135944442,135945847,135946373}                                                                                                             │
│ ends_i         │ {135937455,135939941,135940149,135940624,135942047,135942332,135942592,135944245,135944646,135946045,135947250}                                                                                                             │
│ lengths        │ {91,151,123,198,131,108,118,187,204,198,877}                                                                                                                                                                                │
├─[ RECORD 2 ]───┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ hgnc           │ CEL                                                                                                                                                                                                                         │
│ cds_md5        │ 519aa42c4a2192f9ef015b75224d81f7                                                                                                                                                                                            │
│ es_fingerprint │ c16d68bfea4aa9c4f7876cff22afd5b9                                                                                                                                                                                            │
│ tx_ac          │ NM_001807.4                                                                                                                                                                                                                 │
│ alt_ac         │ NC_000009.11                                                                                                                                                                                                                │
│ alt_aln_method │ blat                                                                                                                                                                                                                        │
│ alt_strand     │ 1                                                                                                                                                                                                                           │
│ exon_set_id    │ 328036                                                                                                                                                                                                                      │
│ n_exons        │ 11                                                                                                                                                                                                                          │
│ se_i           │ 135937364,135937455;135939790,135939941;135940026,135940149;135940426,135940624;135941916,135942047;135942224,135942332;135942474,135942592;135944058,135944245;135944442,135944646;135945847,135946045;135946373,135947250 │
│ starts_i       │ {135937364,135939790,135940026,135940426,135941916,135942224,135942474,135944058,135944442,135945847,135946373}                                                                                                             │
│ ends_i         │ {135937455,135939941,135940149,135940624,135942047,135942332,135942592,135944245,135944646,135946045,135947250}                                                                                                             │
│ lengths        │ {91,151,123,198,131,108,118,187,204,198,877}                                                                                                                                                                                │
├─[ RECORD 3 ]───┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ hgnc           │ CEL                                                                                                                                                                                                                         │
│ cds_md5        │ 519aa42c4a2192f9ef015b75224d81f7                                                                                                                                                                                            │
│ es_fingerprint │ c16d68bfea4aa9c4f7876cff22afd5b9                                                                                                                                                                                            │
│ tx_ac          │ NM_001807.4                                                                                                                                                                                                                 │
│ alt_ac         │ NC_000009.11                                                                                                                                                                                                                │
│ alt_aln_method │ splign-manual                                                                                                                                                                                                               │
│ alt_strand     │ 1                                                                                                                                                                                                                           │
│ exon_set_id    │ 849201                                                                                                                                                                                                                      │
│ n_exons        │ 11                                                                                                                                                                                                                          │
│ se_i           │ 135937364,135937455;135939790,135939941;135940026,135940149;135940426,135940624;135941916,135942047;135942224,135942332;135942474,135942592;135944058,135944245;135944442,135944646;135945847,135946045;135946373,135947250 │
│ starts_i       │ {135937364,135939790,135940026,135940426,135941916,135942224,135942474,135944058,135944442,135945847,135946373}                                                                                                             │
│ ends_i         │ {135937455,135939941,135940149,135940624,135942047,135942332,135942592,135944245,135944646,135946045,135947250}                                                                                                             │
│ lengths        │ {91,151,123,198,131,108,118,187,204,198,877}                                                                                                                                                                                │
└────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
reece commented 4 years ago

See https://github.com/biocommons/uta/tree/master/loading/data/splign-manual for end result.