Merge Invitae local changes used to build recent UTA

Overview

The build process for UTA has had several technical issues for some time now that needed to be addressed so that recurring builds and data releases can resume. The goal of this work was to get the project in a state where they could resume. Listed below are the requirements for this work.

Requirements

Ability to run a UTA/SeqRepo build with minimal intervention.
Provide an alternative to retrieve alignments for NCBI RefSeq transcripts.
Provide a way to introduce UTA schema modifications.
Make build process more transparent.

Changes

Docker

Introduce Docker to containerize the UTA build environment.
Use docker compose with entry point scripts to provide more visibility into the build workflow. -- 1. seqrepo-pull: Pull the latest data version of seqrepo locally. -- 2. ncbi-download: Download files from NCBI needed by build pipeline. -- 3. uta-extract: Extract and transform data from downloaded files. -- 4. seqrepo-load: Load novel sequences into SeqRepo. -- 5. uta-load: Load genes, associated accessions, transcripts and alignments into UTA.
This allows the build to run on any system that has docker installed and the enough disk space (~35 Gb).

Alembic

Introduce Alembic to allow schema/model changes easy and transparent.
With an initial migration file matching that of the current UTA schema several additional changes were made. -- add model for assocacs table -- add gene_id to gene and transcript tables -- make gene_id the primary key for gene and foreign key for transcript -> gene -- add column to transcript for codon table -- create translation_exception table to hold translation exceptions parsed from RefSeq files at NCBI -- create a materialized view for tx_exon_aln_v that can be used in a future HGVS UTA dataprovider

New NCBI input files

Review and determine the minimum set of NCBI files needed for a UTA/SeqRepo build. (etc/ncbi-files.txt)
Download files are first step in the build process.
Transcript exon structure still determined from RefSeq mRNA_Prot GBFF files.
Alignments are parsed from NCBI genome builds annotated with RefSeq transcripts (GFF file format).

One time workflows

Included in this PR are code and configurations of several pre-UTA build workflows ran to get the UTA database and SeqRepo ready for the latest build. -- 1) misc/gene-update/docker-compose-gene-update.yml: The entry point script added the initial Alembic migration, added the gene_id columns, performed the data backfill, and applied the rest of the schema changes. -- 2) misc/mito-transcripts/docker-compose-mito-extract.yml: Extract and transform Mitochondrial gene sequences from NC_012920.1 so they could be loaded into UTA. -- 3) misc/refseq-historical-backfill/docker-compose-backfill.yml: Extract and transform RefSeq transcripts and alignments from "refseq/H_sapiens/historical/GRCh38/GCF_000001405.40-RS_2023_03_historical".

Results of latest build

Historical RefSeq Backfill

Extract intermediate files from NCBI RefSeq backfill (~50 minutes)

docker compose -f docker-compose.yml -f misc/refseq-historical-backfill/docker-compose-backfill.yml run uta-extract-historical

SeqRepo load for historical RefSeq backfill (~ 10 minutes)
```
docker compose run seqrepo-load
```

UTA load for historical RefSeq backfill (~ 4 hrs)

docker compose run uta-load
...
+-----------------------+------+---------+---------+-----+---------+--------+------------------------------------------------+
|         table         |  t   |    n1   |    n2   | nu1 |    nc   |  nu2   |                      cols                      |
+-----------------------+------+---------+---------+-----+---------+--------+------------------------------------------------+
| associated_accessions | 8.8  |  265048 |  274192 |  0  |  265048 |  9144  |              tx_ac,pro_ac,origin               |
|          exon         | 51.9 | 8311010 | 8658305 |  0  | 8311010 | 347295 |                       *                        |
|        exon_aln       | 36.5 | 5604227 | 5810798 |  0  | 5604227 | 206571 | exon_aln_id,tx_exon_id,alt_exon_id,cigar,added |
|        exon_set       | 6.5  |  894156 |  922894 |  45 |  894111 | 28783  |                       *                        |
|          gene         | 0.5  |  64092  |  64643  |  0  |  64092  |  551   |                    gene_id                     |
|          meta         | 0.0  |    5    |    5    |  1  |    4    |   1    |                       *                        |
|         origin        | 0.0  |    6    |    6    |  0  |    6    |   0    |                       *                        |
|          seq          | 27.8 |  340385 |  351449 |  0  |  340385 | 11064  |                       *                        |
|        seq_anno       | 2.8  |  360101 |  371704 |  0  |  360101 | 11603  |     seq_anno_id,seq_id,origin_id,ac,added      |
|       transcript      | 11.1 |  314264 |  325711 |  0  |  314264 | 11447  |                       ac                       |
+-----------------------+------+---------+---------+-----+---------+--------+------------------------------------------------+

UTA/SeqRepo Build

Run ncbi-download to start standard update (~10 minutes)
```
docker compose run ncbi-download
```
Run uta-extract to generate intermediate files from downloaded files
```
docker compose run uta-extract
```
Run SeqRepo load
```
docker compose run seqrepo-load
```

Run UTA load

UTA_ETL_NEW_UTA_VERSION=uta_20240523 docker compose run uta-load
...
+-----------------------+------+---------+---------+-----+---------+---------+------------------------------------------------+
|         table         |  t   |    n1   |    n2   | nu1 |    nc   |   nu2   |                      cols                      |
+-----------------------+------+---------+---------+-----+---------+---------+------------------------------------------------+
| associated_accessions | 13.9 |  274192 |  405253 |  0  |  274192 |  131061 |              tx_ac,pro_ac,origin               |
|          exon         | 94.9 | 8658305 | 9716651 |  0  | 8658305 | 1058346 |                       *                        |
|        exon_aln       | 75.2 | 5810798 | 6847303 |  0  | 5810798 | 1036505 | exon_aln_id,tx_exon_id,alt_exon_id,cigar,added |
|        exon_set       | 13.6 |  922894 | 1022751 |  30 |  922864 |  99887  |                       *                        |
|          gene         | 3.3  |  64643  |  229123 |  0  |  64643  |  164480 |                    gene_id                     |
|          meta         | 0.0  |    5    |    5    |  1  |    4    |    1    |                       *                        |
|         origin        | 0.0  |    6    |    6    |  0  |    6    |    0    |                       *                        |
|          seq          | 48.3 |  351449 |  354745 |  0  |  351449 |   3296  |                       *                        |
|        seq_anno       | 3.9  |  371704 |  375097 |  0  |  371704 |   3393  |     seq_anno_id,seq_id,origin_id,ac,added      |
|       transcript      | 17.7 |  325711 |  328839 |  0  |  325711 |   3128  |                       ac                       |
+-----------------------+------+---------+---------+-----+---------+---------+------------------------------------------------+

How to test

Running from latest UTA release (uta_20210129b)

You will need to set some local working directories and a variable for the new uta build artifact

Build the UTA image

docker build --target uta -t uta-update .

Set necessary env variables

export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b
export UTA_ETL_NEW_UTA_VERSION=uta_20240522
export UTA_ETL_NCBI_DIR=./ncbi-data
export UTA_ETL_WORK_DIR=./output/artifacts
export UTA_ETL_LOG_DIR=./output/logs

Run gene id schema and data migration (~10-15 minutes)

compose -f docker-compose.yml -f misc/gene-update/docker-compose-gene-update.yml run uta-gene-update

Running the standard UTA build using output artifact from last step

Pull SeqRepo (~30 mintues)
```
docker compose run seqrepo-pull
```
Download files from NCBI (~10 minutes)
```
docker compose run ncbi-download
```
Run uta-extract to generate intermediate files from downloaded files
```
docker compose run uta-extract
```
Run SeqRepo load
```
docker compose run seqrepo-load
```

Run UTA load

UTA_ETL_OLD_UTA_VERSION=uta_20240522 \
UTA_ETL_NEW_UTA_VERSION=uta_20240523 \
docker compose run uta-load

Thank you for this amazing work. What is the best way to review this? I think I will check out this branch and try to build a local UTA with it.

Will there be any changes necessary to the hgvs dataprovider?

FYI for anyone else running this process...

I found that the alembic step in uta-load fails because the genes table already existed.

Found and ran another docker-compose step in the repo called "uta-gene-update".

Here is the step I ended up running:

UTA_ETL_OLD_UTA_VERSION=uta_20210129b UTA_ETL_NEW_UTA_VERSION=uta_20210129c docker compose -f docker-compose.yml -f misc/gene-update/docker-compose-gene-update.yml run uta-gene-update

Then changed my env vars to use the new UTA version (notice "c" instead of "b" due to the above UTA_ETL_NEW_UTA_VERSION).

export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b
export UTA_ETL_OLD_UTA_VERSION=uta_20210129c
export UTA_ETL_NEW_UTA_VERSION=uta_20240817
export UTA_ETL_NCBI_DIR=./ncbi-data
export UTA_ETL_WORK_DIR=./output/artifacts
export UTA_ETL_LOG_DIR=./output/logs
export UTA_SPLIGN_MANUAL_DIR=$(pwd)/loading/data/splign-manual/

Then this worked for me:

docker compose run uta-load

HTH others facing a similar situation.

Thank you for this amazing work. What is the best way to review this? I think I will check out this branch and try to build a local UTA with it.

Will there be any changes necessary to the hgvs dataprovider?

Initially we were going to make changes to the HGVS dataprovider, but turned out to be more difficult than previously thought. I think we will need to have discussions on how best to do this.

many of the breaking changes were reverted and 'hgnc' added back to the transcript table. There was one manual step to update the transcript table from the gene table. I should add something about that in the readme. Thanks for asking this question.

FYI for anyone else running this process...

I found that the alembic step in uta-load fails because the genes table already existed.

Found and ran another docker-compose step in the repo called "uta-gene-update".

Here is the step I ended up running:
UTA_ETL_OLD_UTA_VERSION=uta_20210129b UTA_ETL_NEW_UTA_VERSION=uta_20210129c docker compose -f docker-compose.yml -f misc/gene-update/docker-compose-gene-update.yml run uta-gene-update
Then changed my env vars to use the new UTA version (notice "c" instead of "b" due to the above UTA_ETL_NEW_UTA_VERSION).
export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b
export UTA_ETL_OLD_UTA_VERSION=uta_20210129c
export UTA_ETL_NEW_UTA_VERSION=uta_20240817
export UTA_ETL_NCBI_DIR=./ncbi-data
export UTA_ETL_WORK_DIR=./output/artifacts
export UTA_ETL_LOG_DIR=./output/logs
export UTA_SPLIGN_MANUAL_DIR=$(pwd)/loading/data/splign-manual/
Then this worked for me:
docker compose run uta-load
HTH others facing a similar situation.

The intent was to upload the artifact from our UTA update to biocommons. If you use that database as the starting point it comes with the alembic table and will not attempt to run completed migrations. If ran from the latest biocommons build then the "gene update" compose statement is the correct place to start from.

I managed to build my own new UTA database, following @imaurer 's suggestions. Next step will be to move this onto an RDS instance and do some content testing.

@bsgiles73 how do you recommend providing the seqrepo update? I think the data right now is in the image, and not mounted to the host file system. Would it be easier to have an external mount? Or is there a trick to get the data out easily? Thanks!

I run multiple updates over the last few days and overall this works pretty well. My local UTA now has 328,949 transcripts (plus mito). Thank you again for contributing this major improvement!

Some observations: The alignment step is pretty memory intense. My initial VM config did not have enough memory and it crashed since the alignment process got killed (after a few hours, which was a bit frustrating). After changing the memory setting to allow 13GB of ram, things are running smoothly now.

There are three other minor observations: 1) The anonymous user does not have permissions to SELECT on the views. I needed to fix that by hand. 2) One view needs to be renamed, so it does not break hgvs: alter view tx_def_summary_dv rename to tx_def_summary_v. 3) It is not clear to me how to get seqrepo out for distribution. Can we add some documentation for that?

Thanks!!!

I managed to build my own new UTA database, following @imaurer 's suggestions. Next step will be to move this onto an RDS instance and do some content testing.

@bsgiles73 how do you recommend providing the seqrepo update? I think the data right now is in the image, and not mounted to the host file system. Would it be easier to have an external mount? Or is there a trick to get the data out easily? Thanks!

@andreasprlic Thanks for doing the testing. If you have permissions we should be able to rsync the new SeqRepo directory to stuart. Once it is there building the image should be straight forward. I have not done this yet, as I wanted to give time for this PR to be reviewed and tested. A todo is to provide an updated doc on how to run a new build.

Using docker-compose it is more straightforward to work with SeqRepo as a docker container. Which is why we built it this way. I don't think copying it out will be an issue.

I run multiple updates over the last few days and overall this works pretty well. My local UTA now has 328,949 transcripts (plus mito). Thank you again for contributing this major improvement!

Some observations: The alignment step is pretty memory intense. My initial VM config did not have enough memory and it crashed since the alignment process got killed (after a few hours, which was a bit frustrating). After changing the memory setting to allow 13GB of ram, things are running smoothly now.

There are three other minor observations:

The anonymous user does not have permissions to SELECT on the views. I needed to fix that by hand.

One view needs to be renamed, so it does not break hgvs: alter view tx_def_summary_dv rename to tx_def_summary_v.

It is not clear to me how to get seqrepo out for distribution. Can we add some documentation for that?

Thanks!!!

Thanks for the feedback @andreasprlic.

This is a good point. The workflow ends with the artifact. Then there will be steps to install the new schema on a host system. Currently setting permissions is part of that process. Which we didn't build a workflow for.
Let me check the Alembic migrations, I thought we had fixed this. But perhaps I did it by hand as well!
Once the uta_load completes. There are two artifacts which need to be rsync'd to biocommons stuart. Once is the new uta psql dump file and the second the new seqrepo directory. Once that is done we should be able to login to stuart to build and push the new docker images. I think that is how it could work. I have not done it yet, so I don't have docs for it yet.

biocommons / uta