Open bsgiles73 opened 3 months ago
Thank you for this amazing work. What is the best way to review this? I think I will check out this branch and try to build a local UTA with it.
Will there be any changes necessary to the hgvs dataprovider?
FYI for anyone else running this process...
I found that the alembic step in uta-load
fails because the genes table already existed.
Found and ran another docker-compose step in the repo called "uta-gene-update".
Here is the step I ended up running:
UTA_ETL_OLD_UTA_VERSION=uta_20210129b UTA_ETL_NEW_UTA_VERSION=uta_20210129c docker compose -f docker-compose.yml -f misc/gene-update/docker-compose-gene-update.yml run uta-gene-update
Then changed my env vars to use the new UTA version (notice "c" instead of "b" due to the above UTA_ETL_NEW_UTA_VERSION).
export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b
export UTA_ETL_OLD_UTA_VERSION=uta_20210129c
export UTA_ETL_NEW_UTA_VERSION=uta_20240817
export UTA_ETL_NCBI_DIR=./ncbi-data
export UTA_ETL_WORK_DIR=./output/artifacts
export UTA_ETL_LOG_DIR=./output/logs
export UTA_SPLIGN_MANUAL_DIR=$(pwd)/loading/data/splign-manual/
Then this worked for me:
docker compose run uta-load
HTH others facing a similar situation.
Thank you for this amazing work. What is the best way to review this? I think I will check out this branch and try to build a local UTA with it.
Will there be any changes necessary to the hgvs dataprovider?
Initially we were going to make changes to the HGVS dataprovider, but turned out to be more difficult than previously thought. I think we will need to have discussions on how best to do this.
many of the breaking changes were reverted and 'hgnc' added back to the transcript table. There was one manual step to update the transcript table from the gene table. I should add something about that in the readme. Thanks for asking this question.
FYI for anyone else running this process...
I found that the alembic step in
uta-load
fails because the genes table already existed.Found and ran another docker-compose step in the repo called "uta-gene-update".
Here is the step I ended up running:
UTA_ETL_OLD_UTA_VERSION=uta_20210129b UTA_ETL_NEW_UTA_VERSION=uta_20210129c docker compose -f docker-compose.yml -f misc/gene-update/docker-compose-gene-update.yml run uta-gene-update
Then changed my env vars to use the new UTA version (notice "c" instead of "b" due to the above UTA_ETL_NEW_UTA_VERSION).
export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b export UTA_ETL_OLD_UTA_VERSION=uta_20210129c export UTA_ETL_NEW_UTA_VERSION=uta_20240817 export UTA_ETL_NCBI_DIR=./ncbi-data export UTA_ETL_WORK_DIR=./output/artifacts export UTA_ETL_LOG_DIR=./output/logs export UTA_SPLIGN_MANUAL_DIR=$(pwd)/loading/data/splign-manual/
Then this worked for me:
docker compose run uta-load
HTH others facing a similar situation.
The intent was to upload the artifact from our UTA update to biocommons. If you use that database as the starting point it comes with the alembic table and will not attempt to run completed migrations. If ran from the latest biocommons build then the "gene update" compose statement is the correct place to start from.
I managed to build my own new UTA database, following @imaurer 's suggestions. Next step will be to move this onto an RDS instance and do some content testing.
@bsgiles73 how do you recommend providing the seqrepo update? I think the data right now is in the image, and not mounted to the host file system. Would it be easier to have an external mount? Or is there a trick to get the data out easily? Thanks!
I run multiple updates over the last few days and overall this works pretty well. My local UTA now has 328,949
transcripts (plus mito). Thank you again for contributing this major improvement!
Some observations: The alignment step is pretty memory
intense. My initial VM config did not have enough memory and it crashed since the alignment process got killed (after a few hours, which was a bit frustrating). After changing the memory setting to allow 13GB
of ram, things are running smoothly now.
There are three other minor observations:
1) The anonymous
user does not have permissions to SELECT
on the views. I needed to fix that by hand.
2) One view needs to be renamed, so it does not break hgvs: alter view tx_def_summary_dv rename to tx_def_summary_v
.
3) It is not clear to me how to get seqrepo
out for distribution. Can we add some documentation for that?
Thanks!!!
I managed to build my own new UTA database, following @imaurer 's suggestions. Next step will be to move this onto an RDS instance and do some content testing.
@bsgiles73 how do you recommend providing the seqrepo update? I think the data right now is in the image, and not mounted to the host file system. Would it be easier to have an external mount? Or is there a trick to get the data out easily? Thanks!
@andreasprlic Thanks for doing the testing. If you have permissions we should be able to rsync the new SeqRepo directory to stuart. Once it is there building the image should be straight forward. I have not done this yet, as I wanted to give time for this PR to be reviewed and tested. A todo is to provide an updated doc on how to run a new build.
Using docker-compose it is more straightforward to work with SeqRepo as a docker container. Which is why we built it this way. I don't think copying it out will be an issue.
I run multiple updates over the last few days and overall this works pretty well. My local UTA now has
328,949
transcripts (plus mito). Thank you again for contributing this major improvement!Some observations: The alignment step is pretty
memory
intense. My initial VM config did not have enough memory and it crashed since the alignment process got killed (after a few hours, which was a bit frustrating). After changing the memory setting to allow13GB
of ram, things are running smoothly now.There are three other minor observations:
- The
anonymous
user does not have permissions toSELECT
on the views. I needed to fix that by hand.- One view needs to be renamed, so it does not break hgvs:
alter view tx_def_summary_dv rename to tx_def_summary_v
.- It is not clear to me how to get
seqrepo
out for distribution. Can we add some documentation for that?Thanks!!!
Thanks for the feedback @andreasprlic.
Overview
The build process for UTA has had several technical issues for some time now that needed to be addressed so that recurring builds and data releases can resume. The goal of this work was to get the project in a state where they could resume. Listed below are the requirements for this work.
Requirements
Changes
Docker
docker compose
with entry point scripts to provide more visibility into the build workflow. -- 1. seqrepo-pull: Pull the latest data version of seqrepo locally. -- 2. ncbi-download: Download files from NCBI needed by build pipeline. -- 3. uta-extract: Extract and transform data from downloaded files. -- 4. seqrepo-load: Load novel sequences into SeqRepo. -- 5. uta-load: Load genes, associated accessions, transcripts and alignments into UTA.Alembic
tx_exon_aln_v
that can be used in a future HGVS UTA dataproviderNew NCBI input files
etc/ncbi-files.txt
)One time workflows
"refseq/H_sapiens/historical/GRCh38/GCF_000001405.40-RS_2023_03_historical"
.Results of latest build
Historical RefSeq Backfill
UTA/SeqRepo Build
How to test
Running from latest UTA release (uta_20210129b)
You will need to set some local working directories and a variable for the new uta build artifact
Running the standard UTA build using output artifact from last step