Set up production workflow

mpoelchau commented 4 years ago

I think we are ready to develop a new (separate) workflow - setup production. The current workflow is for initially processing and setting up the data. Once this initial setup has been reviewed, we'll need a workflow to move everything to our production servers.

Steps (this will probably change):

Apollo
- on apollo server (hmm not sure how to do this on separate servers...)
- mkdir -p /app/data/other_species/[gggsss]/[assemblyname]
- on apollo-stage (should use appropriate rsync command instead, e.g. rsync -avP:
- scp -r /app/data/other_species/[gggsss]/[assemblyname]/jbrowse apollo-node1:/app/data/other_species/[gggsss]/[assemblyname]
- scp -r /usr/local/blat/db/[gggsss]/ apollo-node1:/usr/local/blat/db
- run createorganism.cwl to add organism and admin to Apollo2 (will need separate host name, admin username, admin password)
Genomics-workspace:
- UPDATE: this needs to be in 3 separate workflows. The first 2 should be run from the gmod-stage server. The last workflow needs to be run on the production server.
- First workflow:
- create folders on gmod-stage: the same as setup_folder.cwl, except without the jbrowse and bigwig directories
- download necessary files to gmod-stage: Genome fasta, protein fasta, RNA/transcript fasta, CDS fasta. Could re-use flow_download/workflow.cwl except for the gff3 download
- then put these files in the correct directories (could use flow_dispatch/2other_species/workflow.cwl up to L82 - don't need apollo files)
- reformat IDs? let's not worry about this for now.
- run flow_genomics-workspace on gmod-stage server
- Second workflow (after the admin user checks whether or not everything is set up properly on gmod-stage:
- generate yml file for third workflow ( running genomics-workspace on prod server)
- create folders on gmod-node1 (FOR TESTING USE GMOD-DEV): the same as setup_folder.cwl, except without the jbrowse directories
- rsync fasta files required by genomics-workspace workflow to gmod-node1 (FOR TESTING USE GMOD-DEV)
- rsync yml file to production server (to /app/data/working_files/gggsss/assemblyname)

Third workflow.
- run flow_genomics-workspace on gmod-node1 (production) server

Data downloads (on gmod-stage or gmod-node1? not sure yet)
- rsync genome fasta, protein fasta, RNA/transcript fasta, CDS fasta, genome gff from apollo-stage1 to gmod-node1 (put in same directories as they are on stage). (Alternatively, you could reuse flow_download/workflow.cwl)
- create readme files (Monica needs to give Kelly templates - see also https://gitlab.com/i5k_Workspace/organism-setup-templates)
- refactor this messy script for cwl: reorganize_symlinks_v2.sh
- create a symlink to the data downloads directory 'Current Genome Assembly'
- we may have to scp symbolic links to another server. Let's worry about that later

mpoelchau commented 4 years ago

@HsiuKangHuang I missed a step for 1. Apollo- we'll need to rsync the scaffold/bigwig/ folder to production, as well. Can you add that to the apolloServer-workflow.cwl workflow? Thanks!

mpoelchau commented 4 years ago

@HsiuKangHuang - Here's an update on how the new 'production' workflows should be organized. Sorry about the changes - I didn't realize earlier what steps would work well together in terms of the cwl setup.

Workflow changes:

Create a 'Move the data' workflow - move data from apollo-stage to apollo-node1, gmod-stage, gmod-node1
- apollo-node1: Use steps 1-4 from apolloServer-workflow.cwl (createFolder, dataTransfer-bigwig, dataTransfer-jbrowse, dataTransfer-blat)
- gmod-stage: copy the entire scaffold/ folder from apollo-stage to gmod-stage
- gmod-node1: copy the entire scaffold/ folder from apollo-stage to gmod-node1
Create a new createorganism workflow
- combines createOrganism.cwl and cat_createOrganismLog
- Will be run separately, after the 'move the data' workflow
Keep the Genomics-workspace workflow
- Data wrangler will perform this separately on both stage and prod, after the 'move the data' workflow
- How and when to run will be included in separate documentation (e.g. run 'move the data' workflow, run genomics-workspace on stage, check that it's good, then run the same workflow on prod)
- so I don't think we'll need to do anything new for this yet
Create a new create-symlinks workflow
- Data wrangler will perform this separately on both stage and prod, after the 'move the data' workflow
- includes "reorganize_symlinks_v2.sh" and "create a symlink to the data downloads directory 'Current Genome Assembly'"
Functional annotation workflow
- let's keep this separate for now, will include in documentation later

yml file(s):

Would be nice if all the info for all these workflows could be in the original final-workflow.yml file
But not necessary if it's too complicated

HsiuKangHuang commented 4 years ago

Okay! Thank you Monica for organizing these steps. I will modify the workflows.

mpoelchau commented 4 years ago

I think this is completed via #94.

NAL-i5K / Organism_Onboarding

Set up production workflow #87