NAL-i5K / Organism_Onboarding

A workflow to make organism onboarding pipeline easy to handle as an I/O pipeline
4 stars 1 forks source link

Set up production workflow #87

Closed mpoelchau closed 4 years ago

mpoelchau commented 4 years ago

I think we are ready to develop a new (separate) workflow - setup production. The current workflow is for initially processing and setting up the data. Once this initial setup has been reviewed, we'll need a workflow to move everything to our production servers.

Steps (this will probably change):

  1. Apollo

    • on apollo server (hmm not sure how to do this on separate servers...)
    • mkdir -p /app/data/other_species/[gggsss]/[assemblyname]
    • on apollo-stage (should use appropriate rsync command instead, e.g. rsync -avP:
    • scp -r /app/data/other_species/[gggsss]/[assemblyname]/jbrowse apollo-node1:/app/data/other_species/[gggsss]/[assemblyname]
    • scp -r /usr/local/blat/db/[gggsss]/ apollo-node1:/usr/local/blat/db
    • run createorganism.cwl to add organism and admin to Apollo2 (will need separate host name, admin username, admin password)
  2. Genomics-workspace:

    • UPDATE: this needs to be in 3 separate workflows. The first 2 should be run from the gmod-stage server. The last workflow needs to be run on the production server.

    • First workflow:

    • create folders on gmod-stage: the same as setup_folder.cwl, except without the jbrowse and bigwig directories

    • download necessary files to gmod-stage: Genome fasta, protein fasta, RNA/transcript fasta, CDS fasta. Could re-use flow_download/workflow.cwl except for the gff3 download

    • then put these files in the correct directories (could use flow_dispatch/2other_species/workflow.cwl up to L82 - don't need apollo files)

    • reformat IDs? let's not worry about this for now.

    • run flow_genomics-workspace on gmod-stage server

    • Second workflow (after the admin user checks whether or not everything is set up properly on gmod-stage:

    • generate yml file for third workflow ( running genomics-workspace on prod server)

    • create folders on gmod-node1 (FOR TESTING USE GMOD-DEV): the same as setup_folder.cwl, except without the jbrowse directories

    • rsync fasta files required by genomics-workspace workflow to gmod-node1 (FOR TESTING USE GMOD-DEV)

    • rsync yml file to production server (to /app/data/working_files/gggsss/assemblyname)

  1. Data downloads (on gmod-stage or gmod-node1? not sure yet)
    • rsync genome fasta, protein fasta, RNA/transcript fasta, CDS fasta, genome gff from apollo-stage1 to gmod-node1 (put in same directories as they are on stage). (Alternatively, you could reuse flow_download/workflow.cwl)
    • create readme files (Monica needs to give Kelly templates - see also https://gitlab.com/i5k_Workspace/organism-setup-templates)
    • refactor this messy script for cwl: reorganize_symlinks_v2.sh
    • create a symlink to the data downloads directory 'Current Genome Assembly'
    • we may have to scp symbolic links to another server. Let's worry about that later
mpoelchau commented 4 years ago

@HsiuKangHuang I missed a step for 1. Apollo- we'll need to rsync the scaffold/bigwig/ folder to production, as well. Can you add that to the apolloServer-workflow.cwl workflow? Thanks!

mpoelchau commented 4 years ago

@HsiuKangHuang - Here's an update on how the new 'production' workflows should be organized. Sorry about the changes - I didn't realize earlier what steps would work well together in terms of the cwl setup.

Workflow changes:

yml file(s):

HsiuKangHuang commented 4 years ago

Okay! Thank you Monica for organizing these steps. I will modify the workflows.

mpoelchau commented 4 years ago

I think this is completed via #94.