biocommons / uta

Universal Transcript Archive: comprehensive genome-transcript alignments; multiple transcript sources, versions, and alignment methods; available as a docker image
Apache License 2.0
62 stars 26 forks source link

Merge Invitae local changes used to build recent UTA #261

Open bsgiles73 opened 3 months ago

bsgiles73 commented 3 months ago

Overview

The build process for UTA has had several technical issues for some time now that needed to be addressed so that recurring builds and data releases can resume. The goal of this work was to get the project in a state where they could resume. Listed below are the requirements for this work.

Requirements

Changes

Docker

Alembic

New NCBI input files

One time workflows

Results of latest build

Historical RefSeq Backfill

  1. Extract intermediate files from NCBI RefSeq backfill (~50 minutes)
    docker compose -f docker-compose.yml -f misc/refseq-historical-backfill/docker-compose-backfill.yml run uta-extract-historical
  2. SeqRepo load for historical RefSeq backfill (~ 10 minutes)
    docker compose run seqrepo-load
  3. UTA load for historical RefSeq backfill (~ 4 hrs)
    docker compose run uta-load
    ...
    +-----------------------+------+---------+---------+-----+---------+--------+------------------------------------------------+
    |         table         |  t   |    n1   |    n2   | nu1 |    nc   |  nu2   |                      cols                      |
    +-----------------------+------+---------+---------+-----+---------+--------+------------------------------------------------+
    | associated_accessions | 8.8  |  265048 |  274192 |  0  |  265048 |  9144  |              tx_ac,pro_ac,origin               |
    |          exon         | 51.9 | 8311010 | 8658305 |  0  | 8311010 | 347295 |                       *                        |
    |        exon_aln       | 36.5 | 5604227 | 5810798 |  0  | 5604227 | 206571 | exon_aln_id,tx_exon_id,alt_exon_id,cigar,added |
    |        exon_set       | 6.5  |  894156 |  922894 |  45 |  894111 | 28783  |                       *                        |
    |          gene         | 0.5  |  64092  |  64643  |  0  |  64092  |  551   |                    gene_id                     |
    |          meta         | 0.0  |    5    |    5    |  1  |    4    |   1    |                       *                        |
    |         origin        | 0.0  |    6    |    6    |  0  |    6    |   0    |                       *                        |
    |          seq          | 27.8 |  340385 |  351449 |  0  |  340385 | 11064  |                       *                        |
    |        seq_anno       | 2.8  |  360101 |  371704 |  0  |  360101 | 11603  |     seq_anno_id,seq_id,origin_id,ac,added      |
    |       transcript      | 11.1 |  314264 |  325711 |  0  |  314264 | 11447  |                       ac                       |
    +-----------------------+------+---------+---------+-----+---------+--------+------------------------------------------------+

    UTA/SeqRepo Build

  4. Run ncbi-download to start standard update (~10 minutes)
    docker compose run ncbi-download
  5. Run uta-extract to generate intermediate files from downloaded files
    docker compose run uta-extract
  6. Run SeqRepo load
    docker compose run seqrepo-load
  7. Run UTA load
    UTA_ETL_NEW_UTA_VERSION=uta_20240523 docker compose run uta-load
    ...
    +-----------------------+------+---------+---------+-----+---------+---------+------------------------------------------------+
    |         table         |  t   |    n1   |    n2   | nu1 |    nc   |   nu2   |                      cols                      |
    +-----------------------+------+---------+---------+-----+---------+---------+------------------------------------------------+
    | associated_accessions | 13.9 |  274192 |  405253 |  0  |  274192 |  131061 |              tx_ac,pro_ac,origin               |
    |          exon         | 94.9 | 8658305 | 9716651 |  0  | 8658305 | 1058346 |                       *                        |
    |        exon_aln       | 75.2 | 5810798 | 6847303 |  0  | 5810798 | 1036505 | exon_aln_id,tx_exon_id,alt_exon_id,cigar,added |
    |        exon_set       | 13.6 |  922894 | 1022751 |  30 |  922864 |  99887  |                       *                        |
    |          gene         | 3.3  |  64643  |  229123 |  0  |  64643  |  164480 |                    gene_id                     |
    |          meta         | 0.0  |    5    |    5    |  1  |    4    |    1    |                       *                        |
    |         origin        | 0.0  |    6    |    6    |  0  |    6    |    0    |                       *                        |
    |          seq          | 48.3 |  351449 |  354745 |  0  |  351449 |   3296  |                       *                        |
    |        seq_anno       | 3.9  |  371704 |  375097 |  0  |  371704 |   3393  |     seq_anno_id,seq_id,origin_id,ac,added      |
    |       transcript      | 17.7 |  325711 |  328839 |  0  |  325711 |   3128  |                       ac                       |
    +-----------------------+------+---------+---------+-----+---------+---------+------------------------------------------------+

How to test

Running from latest UTA release (uta_20210129b)

You will need to set some local working directories and a variable for the new uta build artifact

  1. Build the UTA image
    docker build --target uta -t uta-update .
  2. Set necessary env variables
    export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b
    export UTA_ETL_NEW_UTA_VERSION=uta_20240522
    export UTA_ETL_NCBI_DIR=./ncbi-data
    export UTA_ETL_WORK_DIR=./output/artifacts
    export UTA_ETL_LOG_DIR=./output/logs
  3. Run gene id schema and data migration (~10-15 minutes)
    compose -f docker-compose.yml -f misc/gene-update/docker-compose-gene-update.yml run uta-gene-update

Running the standard UTA build using output artifact from last step

  1. Pull SeqRepo (~30 mintues)
    docker compose run seqrepo-pull
  2. Download files from NCBI (~10 minutes)
    docker compose run ncbi-download
  3. Run uta-extract to generate intermediate files from downloaded files
    docker compose run uta-extract
  4. Run SeqRepo load
    docker compose run seqrepo-load
  5. Run UTA load
    UTA_ETL_OLD_UTA_VERSION=uta_20240522 \
    UTA_ETL_NEW_UTA_VERSION=uta_20240523 \
    docker compose run uta-load
andreasprlic commented 3 months ago

Thank you for this amazing work. What is the best way to review this? I think I will check out this branch and try to build a local UTA with it.

Will there be any changes necessary to the hgvs dataprovider?

imaurer commented 2 months ago

FYI for anyone else running this process...

I found that the alembic step in uta-load fails because the genes table already existed.

Found and ran another docker-compose step in the repo called "uta-gene-update".

Here is the step I ended up running:

UTA_ETL_OLD_UTA_VERSION=uta_20210129b UTA_ETL_NEW_UTA_VERSION=uta_20210129c docker compose -f docker-compose.yml -f misc/gene-update/docker-compose-gene-update.yml run uta-gene-update

Then changed my env vars to use the new UTA version (notice "c" instead of "b" due to the above UTA_ETL_NEW_UTA_VERSION).

export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b
export UTA_ETL_OLD_UTA_VERSION=uta_20210129c
export UTA_ETL_NEW_UTA_VERSION=uta_20240817
export UTA_ETL_NCBI_DIR=./ncbi-data
export UTA_ETL_WORK_DIR=./output/artifacts
export UTA_ETL_LOG_DIR=./output/logs
export UTA_SPLIGN_MANUAL_DIR=$(pwd)/loading/data/splign-manual/

Then this worked for me:

docker compose run uta-load

HTH others facing a similar situation.

bsgiles73 commented 2 months ago

Thank you for this amazing work. What is the best way to review this? I think I will check out this branch and try to build a local UTA with it.

Will there be any changes necessary to the hgvs dataprovider?

Initially we were going to make changes to the HGVS dataprovider, but turned out to be more difficult than previously thought. I think we will need to have discussions on how best to do this.

many of the breaking changes were reverted and 'hgnc' added back to the transcript table. There was one manual step to update the transcript table from the gene table. I should add something about that in the readme. Thanks for asking this question.

bsgiles73 commented 2 months ago

FYI for anyone else running this process...

I found that the alembic step in uta-load fails because the genes table already existed.

Found and ran another docker-compose step in the repo called "uta-gene-update".

Here is the step I ended up running:

UTA_ETL_OLD_UTA_VERSION=uta_20210129b UTA_ETL_NEW_UTA_VERSION=uta_20210129c docker compose -f docker-compose.yml -f misc/gene-update/docker-compose-gene-update.yml run uta-gene-update

Then changed my env vars to use the new UTA version (notice "c" instead of "b" due to the above UTA_ETL_NEW_UTA_VERSION).

export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b
export UTA_ETL_OLD_UTA_VERSION=uta_20210129c
export UTA_ETL_NEW_UTA_VERSION=uta_20240817
export UTA_ETL_NCBI_DIR=./ncbi-data
export UTA_ETL_WORK_DIR=./output/artifacts
export UTA_ETL_LOG_DIR=./output/logs
export UTA_SPLIGN_MANUAL_DIR=$(pwd)/loading/data/splign-manual/

Then this worked for me:

docker compose run uta-load

HTH others facing a similar situation.

The intent was to upload the artifact from our UTA update to biocommons. If you use that database as the starting point it comes with the alembic table and will not attempt to run completed migrations. If ran from the latest biocommons build then the "gene update" compose statement is the correct place to start from.

andreasprlic commented 1 month ago

I managed to build my own new UTA database, following @imaurer 's suggestions. Next step will be to move this onto an RDS instance and do some content testing.

@bsgiles73 how do you recommend providing the seqrepo update? I think the data right now is in the image, and not mounted to the host file system. Would it be easier to have an external mount? Or is there a trick to get the data out easily? Thanks!

andreasprlic commented 1 month ago

I run multiple updates over the last few days and overall this works pretty well. My local UTA now has 328,949 transcripts (plus mito). Thank you again for contributing this major improvement!

Some observations: The alignment step is pretty memory intense. My initial VM config did not have enough memory and it crashed since the alignment process got killed (after a few hours, which was a bit frustrating). After changing the memory setting to allow 13GB of ram, things are running smoothly now.

There are three other minor observations: 1) The anonymous user does not have permissions to SELECT on the views. I needed to fix that by hand. 2) One view needs to be renamed, so it does not break hgvs: alter view tx_def_summary_dv rename to tx_def_summary_v. 3) It is not clear to me how to get seqrepo out for distribution. Can we add some documentation for that?

Thanks!!!

bsgiles73 commented 1 month ago

I managed to build my own new UTA database, following @imaurer 's suggestions. Next step will be to move this onto an RDS instance and do some content testing.

@bsgiles73 how do you recommend providing the seqrepo update? I think the data right now is in the image, and not mounted to the host file system. Would it be easier to have an external mount? Or is there a trick to get the data out easily? Thanks!

@andreasprlic Thanks for doing the testing. If you have permissions we should be able to rsync the new SeqRepo directory to stuart. Once it is there building the image should be straight forward. I have not done this yet, as I wanted to give time for this PR to be reviewed and tested. A todo is to provide an updated doc on how to run a new build.

Using docker-compose it is more straightforward to work with SeqRepo as a docker container. Which is why we built it this way. I don't think copying it out will be an issue.

bsgiles73 commented 1 month ago

I run multiple updates over the last few days and overall this works pretty well. My local UTA now has 328,949 transcripts (plus mito). Thank you again for contributing this major improvement!

Some observations: The alignment step is pretty memory intense. My initial VM config did not have enough memory and it crashed since the alignment process got killed (after a few hours, which was a bit frustrating). After changing the memory setting to allow 13GB of ram, things are running smoothly now.

There are three other minor observations:

  1. The anonymous user does not have permissions to SELECT on the views. I needed to fix that by hand.
  2. One view needs to be renamed, so it does not break hgvs: alter view tx_def_summary_dv rename to tx_def_summary_v.
  3. It is not clear to me how to get seqrepo out for distribution. Can we add some documentation for that?

Thanks!!!

Thanks for the feedback @andreasprlic.

  1. This is a good point. The workflow ends with the artifact. Then there will be steps to install the new schema on a host system. Currently setting permissions is part of that process. Which we didn't build a workflow for.
  2. Let me check the Alembic migrations, I thought we had fixed this. But perhaps I did it by hand as well!
  3. Once the uta_load completes. There are two artifacts which need to be rsync'd to biocommons stuart. Once is the new uta psql dump file and the second the new seqrepo directory. Once that is done we should be able to login to stuart to build and push the new docker images. I think that is how it could work. I have not done it yet, so I don't have docs for it yet.