Existing nci workflow was dependent on PBSnormal queue jobs to complete execution within a day. When PBS jobs queue for longer duration (sometimes couple of days), NCI jobs would simply fail or does nothing.
Also, ingest, fractional cover, and wofs jobs are scheduled differently using qsub. This would make it difficult to use pbsjobdepends after feature.
A better workflow was required to manage job dependencies such that next job is submitted after the existing pbs job has successfully completed.
To achieve this aws step functions, aws dynamodb, aws lambda function, and aws serverless features were used to manage job dependencies.
Proposed solution
1) Orchestrate with AWSstep function, lambdas, and update the job status in dynamodb database.
2) Develop four separate lambdafunctionhandlers to manage job submission, fetch job id's, check job status, and update job failed status.
3) Use serverlessstep functionplugin to invoke awsstep functions.
4) Following new raijin scripts are added:
raijin_scripts/execute_qstat/run file
raijin_scripts/execute_qdel/run file
raijin_scripts/execute_dam_scripts/run file
raijin_scripts/execute_fetch_job_ids/run file
5) Reduce the walltime and requested memorysize in raijin_scripts/execute_clean/run file.
6) Refactor the following files to sync with updated workflow:
raijin_scripts/execute_cog_conversion/run file
raijin_scripts/execute_fractional_cover/run file
raijin_scripts/execute_ingest/run file
raijin_scripts/execute_sync/run file
raijin_scripts/execute_wofs/run file
7) Reduce the walltime and requested memorysize in raijin_scripts/execute_coherence/run file.
Reason for this pull request
Existing nci workflow was dependent on
PBS
normal
queue jobs to complete execution within a day. When PBS jobs queue for longer duration (sometimes couple of days),NCI
jobs would simplyfail
or does nothing. Also,ingest
,fractional cover
, andwofs
jobs are scheduled differently using qsub. This would make it difficult to usepbs
job
depends after
feature.A better workflow was required to manage job dependencies such that next job is submitted after the existing pbs job has successfully completed. To achieve this
aws step functions
,aws dynamodb
,aws lambda function
, andaws serverless
features were used to manage job dependencies.Proposed solution
1) Orchestrate with
AWS
step function
,lambdas
, and update the job status indynamodb
database. 2) Develop four separatelambda
function
handlers
to manage job submission, fetch job id's, check job status, and update job failed status. 3) Useserverless
step function
plugin
to invokeaws
step functions
. 4) Following new raijin scripts are added:raijin_scripts/execute_qstat/run
fileraijin_scripts/execute_qdel/run
fileraijin_scripts/execute_dam_scripts/run
fileraijin_scripts/execute_fetch_job_ids/run
file 5) Reduce thewalltime
and requestedmemory
size
inraijin_scripts/execute_clean/run
file. 6) Refactor the following files to sync with updated workflow:raijin_scripts/execute_cog_conversion/run
fileraijin_scripts/execute_fractional_cover/run
fileraijin_scripts/execute_ingest/run
fileraijin_scripts/execute_sync/run
fileraijin_scripts/execute_wofs/run
file 7) Reduce thewalltime
and requestedmemory
size
inraijin_scripts/execute_coherence/run
file.