Closed reece closed 4 years ago
Generalizing the UTA interface is out of scope for the Invitae work. Therefore, the scope of this task to loading data into UTA. Creating a UTA REST interface will be picked up at another time.
Closed in 7771502f8685b46a447b6b998d6dccd393723275.
QC Test: Andreas' list included a request to align NM_001807.4 to NC_000009.11 (GRCh37 chr 9). It turns out that this is already in more recent UTAs. I processed it anyway and used existing data as a positive control/comparison. Those data are below. (tl;dr: genomic exon coordinates from splign-manual, splign, and blat agree exactly.)
┌─[ RECORD 1 ]───┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ hgnc │ CEL │
│ cds_md5 │ 519aa42c4a2192f9ef015b75224d81f7 │
│ es_fingerprint │ c16d68bfea4aa9c4f7876cff22afd5b9 │
│ tx_ac │ NM_001807.4 │
│ alt_ac │ NC_000009.11 │
│ alt_aln_method │ splign │
│ alt_strand │ 1 │
│ exon_set_id │ 208709 │
│ n_exons │ 11 │
│ se_i │ 135937364,135937455;135939790,135939941;135940026,135940149;135940426,135940624;135941916,135942047;135942224,135942332;135942474,135942592;135944058,135944245;135944442,135944646;135945847,135946045;135946373,135947250 │
│ starts_i │ {135937364,135939790,135940026,135940426,135941916,135942224,135942474,135944058,135944442,135945847,135946373} │
│ ends_i │ {135937455,135939941,135940149,135940624,135942047,135942332,135942592,135944245,135944646,135946045,135947250} │
│ lengths │ {91,151,123,198,131,108,118,187,204,198,877} │
├─[ RECORD 2 ]───┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ hgnc │ CEL │
│ cds_md5 │ 519aa42c4a2192f9ef015b75224d81f7 │
│ es_fingerprint │ c16d68bfea4aa9c4f7876cff22afd5b9 │
│ tx_ac │ NM_001807.4 │
│ alt_ac │ NC_000009.11 │
│ alt_aln_method │ blat │
│ alt_strand │ 1 │
│ exon_set_id │ 328036 │
│ n_exons │ 11 │
│ se_i │ 135937364,135937455;135939790,135939941;135940026,135940149;135940426,135940624;135941916,135942047;135942224,135942332;135942474,135942592;135944058,135944245;135944442,135944646;135945847,135946045;135946373,135947250 │
│ starts_i │ {135937364,135939790,135940026,135940426,135941916,135942224,135942474,135944058,135944442,135945847,135946373} │
│ ends_i │ {135937455,135939941,135940149,135940624,135942047,135942332,135942592,135944245,135944646,135946045,135947250} │
│ lengths │ {91,151,123,198,131,108,118,187,204,198,877} │
├─[ RECORD 3 ]───┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ hgnc │ CEL │
│ cds_md5 │ 519aa42c4a2192f9ef015b75224d81f7 │
│ es_fingerprint │ c16d68bfea4aa9c4f7876cff22afd5b9 │
│ tx_ac │ NM_001807.4 │
│ alt_ac │ NC_000009.11 │
│ alt_aln_method │ splign-manual │
│ alt_strand │ 1 │
│ exon_set_id │ 849201 │
│ n_exons │ 11 │
│ se_i │ 135937364,135937455;135939790,135939941;135940026,135940149;135940426,135940624;135941916,135942047;135942224,135942332;135942474,135942592;135944058,135944245;135944442,135944646;135945847,135946045;135946373,135947250 │
│ starts_i │ {135937364,135939790,135940026,135940426,135941916,135942224,135942474,135944058,135944442,135945847,135946373} │
│ ends_i │ {135937455,135939941,135940149,135940624,135942047,135942332,135942592,135944245,135944646,135946045,135947250} │
│ lengths │ {91,151,123,198,131,108,118,187,204,198,877} │
└────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
See https://github.com/biocommons/uta/tree/master/loading/data/splign-manual for end result.
It would be useful to enable users to create custom transcripts and custom genome-transcript alignments.
The obvious solution is to load transcripts directly into UTA. This approach is not-ideal because it confounds the release process. A more flexible and scalable approach is to "layer" data sets, but this requires an API that federates this information.
@andreasprlic and @melissacline