A new Fonda scripting/launching approach

kamyshova commented 4 years ago

Motivation

There are two crucial drawbacks in the current Fonda implementation:

deadlocks
idle resources

Both problems are related to the implementation of the scripts launching. There are two types of scripts execution:

The first scripts part (e.g. alignment/post alignment scripts) is launched Fonda at the same time.
The second scripts part (e.g. featureCount, cufflinks) is launched from the alignment/post alignment scripts.

The work coordination of all scripts is carried out by checking the log files that scripts produce. But log file can not be created at all if the script was not been invoked (e.g. script is launched from the alignment scripts). In this case, post-process scripts will work forever.

Picture 1. The current Fonda launching approach

For example, post-process scripts (_qcsummary.sh, cufflinkscohort.sh - see Picture 1) are launched with alignment scripts simultaneously. Post-process cufflinks_cohort.sh scripts expect the result of the cufflinks.sh script execution by check the cufflinks.log file. But alignment script can fail before cufflinks.sh script invocation. But _cufflinkscohort.sh will not know about it and will run infinitely.

Deadlocks are specific for launch in the SGE cluster. Each script is a SGE cluster submitted job. The job has specific resource requirements - the number of slots defined by the user in the Fonda global config file (NUMTHREADS parameter in _QueueParameters section). The number of slots is equal to the number of processors in a cluster. The user can set such a number of slots that the cluster size will not be enough for job work. In this case, the job hangs on in a pending state (qw).

For example, the cluster size is 8 CPU. A user sets NUMTHREADS=4. First of all Fonda launches 3 scripts - alignment.sh, qcsummary.sh, _cufflinkscohort.sh. 2 of them (alignment.sh, qcsummary.sh) will be in running status. _cufflinkscohort.sh job is in the qw state which stands for being queued and waiting. In its turn, alignment.sh script invokes cufflinks.sh and featureCount.sh and waits for the results. But the cluster doesn't have available slots. cufflinks.sh and featureCount.sh hang on in a pending state, and alignment.sh job will wait for their result endlessly.

Thus, in the beginning post-process jobs take up resources without performing useful work. On the contrary, idle resources are possible in the case of the autoscale cluster.

Approach

We propose a new approach to scripts launching.

Picture 2. The new proposed approach

As can be seen at the picture above, we create an additional master.sh script-orchestrator to manage all scripts. Fonda will only run master.sh script directly. Initially, the master script starts all alignment.sh scripts and waits for their results. After successful completion of alignment step the cufflinks.sh, featureCount.sh etc scripts are launched if they are needed. Please note that we intend to remove the launch of the script from the alignment/post alignment scripts. After the per samples scripts are executed successfully, master.sh script launches the post-process scripts.

To sum up proposed changes:

create a new master.sh script to manage all scripts
remove launching of scripts from the alignment/post alignment scripts
sequential launching of pipeline stages

This approach proposes getting rid of the above problems and makes the process of launching scripts more transparent. At the same time, this approach preserves the parallelization of processes where it is possible.

mgmagi-sanofi commented 4 years ago

I agree with your assessment and am using the idea of a “master controller” in my own dealing with SGE type environment.

I would question though the implementation of the functionality of “master.sh” using Linux shell script.

Higher level programming languages were invented for more than one reason and what you describe is a good fit to be implemented using Java, C++, Python, or others.

I put Java the first only because it would be my personal choice and because essentially Fonda uses (or at least used in the past) the same idea of “master controller” already in the sense that “central” java written code generates and executes per-sample and accumulating shell scripts.

I know the history behind why Java was chosen for that role in Fonda and can assure it was rather incidental, so my arguments are not “pro” Java, but rather “against“ shell script in favor of a more scalable higher level software development platform.

Mark G Magid Data Management / Software Solutions Precision Oncology Desk: +1 617 665-4295 Mobile: +1 781 929-9673 Email: Mark-EXT.Magid@sanofi.com

From: Yulia Kamyshova notifications@github.com Sent: Friday, July 3, 2020 7:07 AM To: epam/fonda fonda@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [EXTERNAL] [epam/fonda] A new Fonda scripting/launching approach (#162)

EXTERNAL : Real sender is noreply@github.commailto:noreply@github.com

Motivation

There are two crucial drawbacks in the current Fonda implementation:

deadlocks
idle resources

Both problems are related to the implementation of the scripts launching. There are two types of scripts execution:

The first scripts part (e.g. alignment/post alignment scripts) is launched Fonda at the same time.
The second scripts part (e.g. featureCount, cufflinks) is launched from the alignment/post alignment scripts.

The work coordination of all scripts is carried out by checking the log files that scripts produce. But log file can not be created at all if the script was not been invoked (e.g. script is launched from the alignment scripts). In this case, post-process scripts will work forever.

[image]https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_23356251_86462320-2D1a92d180-2Dbd34-2D11ea-2D9df4-2D25f5aeaece87.png&d=DwMCaQ&c=Dbf9zoswcQ-CRvvI7VX5j3HvibIuT3ZiarcKl5qtMPo&r=_FrAJTihthvjPIO-AOuHMV7NIvq2SS13zQWyvVsTu2w&m=bc7jF2RJaRcRBdpksxEsfMs179zuc1ISn0UjGNWSvUk&s=bDw63PUGghL9SAj3ZUQl-Lly-e2WDDWC2rMP1a6A8MU&e=

Picture 1. The current Fonda launching approach

For example, post-process scripts (qcsummary.sh, cufflinks_cohort.sh - see Picture 1) are launched with alignment scripts simultaneously. Post-process cufflinks_cohort.sh scripts expect the result of the cufflinks.sh script execution by check the cufflinks.log file. But alignment script can fail before cufflinks.sh script invocation. But cufflinks_cohort.sh will not know about it and will run infinitely.

Deadlocks are specific for launch in the SGE cluster. Each script is a SGE cluster submitted job. The job has specific resource requirements - the number of slots defined by the user in the Fonda global config file (NUMTHREADS parameter in Queue_Parameters section). The number of slots is equal to the number of processors in a cluster. The user can set such a number of slots that the cluster size will not be enough for job work. In this case, the job hangs on in a pending state (qw).

For example, the cluster size is 8 CPU. A user sets NUMTHREADS=4. First of all Fonda launches 3 scripts - alignment.sh, qcsummary.sh, cufflinks_cohort.sh. 2 of them (alignment.sh, qcsummary.sh) will be in running status. cufflinks_cohort.sh job is in the qw state which stands for being queued and waiting. In its turn, alignment.sh script invokes cufflinks.sh and featureCount.sh and waits for the results. But the cluster doesn't have available slots. cufflinks.sh and featureCount.sh hang on in a pending state, and alignment.sh job will wait for their result endlessly.

Thus, in the beginning post-process jobs take up resources without performing useful work. On the contrary, idle resources are possible in the case of the autoscale cluster.

Approach

We propose a new approach to scripts launching.

[image]https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_23356251_86462428-2D4910ac80-2Dbd34-2D11ea-2D84ef-2D873fa3822cc0.png&d=DwMCaQ&c=Dbf9zoswcQ-CRvvI7VX5j3HvibIuT3ZiarcKl5qtMPo&r=_FrAJTihthvjPIO-AOuHMV7NIvq2SS13zQWyvVsTu2w&m=bc7jF2RJaRcRBdpksxEsfMs179zuc1ISn0UjGNWSvUk&s=DY52bu8FNJTYBIMXqHwVt7GQ2dfLpoRpxbeK1KMEo6A&e=

Picture 2. The new proposed approach

As can be seen at the picture above, we create an additional master.sh script-orchestrator to manage all scripts. Fonda will only run master.sh script directly. Initially, the master script starts all alignment.sh scripts and waits for their results. After successful completion of alignment step the cufflinks.sh, featureCount.sh etc scripts are launched if they are needed. Please note that we intend to remove the launch of the script from the alignment/post alignment scripts. After the per samples scripts are executed successfully, master.sh script launches the post-process scripts.

To sum up proposed changes:

create a new master.sh script to manage all scripts
remove launching of scripts from the alignment/post alignment scripts
sequential launching of pipeline stages

This approach proposes getting rid of the above problems and makes the process of launching scripts more transparent. At the same time, this approach preserves the parallelization of processes where it is possible.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_epam_fonda_issues_162&d=DwMCaQ&c=Dbf9zoswcQ-CRvvI7VX5j3HvibIuT3ZiarcKl5qtMPo&r=_FrAJTihthvjPIO-AOuHMV7NIvq2SS13zQWyvVsTu2w&m=bc7jF2RJaRcRBdpksxEsfMs179zuc1ISn0UjGNWSvUk&s=RMD3J2JljIFWAJvGlTCguGfgu9KQVdoC8F_obDfbPLk&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AN7EDHIH3WXEV6FR2T46NZLRZW3WXANCNFSM4OPW2FAQ&d=DwMCaQ&c=Dbf9zoswcQ-CRvvI7VX5j3HvibIuT3ZiarcKl5qtMPo&r=_FrAJTihthvjPIO-AOuHMV7NIvq2SS13zQWyvVsTu2w&m=bc7jF2RJaRcRBdpksxEsfMs179zuc1ISn0UjGNWSvUk&s=diG2EVlfygPqm9jbl2iIafi4jlnwiAm380QlCK3IevE&e=.

syansanofi commented 4 years ago

I am in favor of this plan. It will be extremely useful for two cases: logging and collaboration.

Currently, FONDA does not have master log functionality, making it difficult to decipher pipeline logic. If orchestration was implemented in Java, then I still do not see how logging can be achieved with ease. The master.sh file should contain an implicit roadmap of the execution logic within it. This change would bring FONDA more in line with other pipelining libraries and languages while preserving OOP advantages of FONDA.

In addition, it will be much easier to share pipeline designs and results with those unfamiliar with Java since shell scripting is more widespread.

epam / fonda

A new Fonda scripting/launching approach #162

Motivation

Approach