Paper: 10 Simple Rules for making a software tool workflow-ready

llivermore commented 3 years ago

Paper providing guidance for writing software intended to slot in as a tool within a workflow; or on converting an existing standalone research-quality software tool into a reusable, composable, well-behaved citizen within a larger workflow.

Original audience was SYNTHESYS+/DiSSCo development teams but scope has now widened.

Target journal: PLOS Ten Simple Rules Target publication date: TBC Lead: @PaulBrack

PaulBrack commented 3 years ago

https://docs.google.com/document/d/1-ZJ9IFtba5UPnKSm6N8oRS3gtDb2vZ6PbUMtc4QcH9g/edit?usp=sharing

PaulBrack commented 3 years ago

Paper now in late stages of writing - up for discussion in Workflows@Manchester meeting on Monday

PaulBrack commented 2 years ago

First draft complete - needs referencing and general preening but I don't think we're far off having something you could submit to a journal

llivermore commented 2 years ago

Submitted to PLoS Ten Simple rules - waiting on reviewer comments.

PaulBrack commented 2 years ago

Pre-print hasn't been published - have contacted PLOS support

PaulBrack commented 2 years ago

Feedback from reviewers

Reviewer #1: This is an interesting and well written editorial highlighting some of the key considerations scientists should think about when trying to make software tools 'workflow-ready' - a key step in getting their tool used by a wider community and/or used in automation.

The 'rules' proposed are logical and, mostly, well explained. Some of these topics are more complex than the scope of an editorial allows, and so it may be that these are not trivial to follow, or may not be simple to abide by, for a self-taught coder scientist.

For example, Rule 6 making software parallelizable skips over some of the genuine difficulties that can be faced when trying to modify existing code to do that, especially if the code was not written with that structure up front. This is especially true when using something like Python (with GIL) or cross-platform (where there can be challenges implementing multithreading on different operating systems consistently and stably).

That being said, the overall editorial is scoped well and contains a reasonable number of references to additional material for the interested reader.

A few minor thoughts: Documentation rule skips over the (perhaps obvious) need to comment your code itself. I would also explicitly tell the reader that they should not be afraid to re-write their code, even substantially. It may be necessary and I think that should be acknowledged. The article just ends - there is no concluding thought or statement. The article goes to great lengths to make very few specific technical recommendations [i.e. language, platform, engine choice] - which is good and sensible for longevity and maximum usefulness. However, this abstraction could leave an inexperienced reader slightly lost. No action is necessarily required here, but it should be recognized.

Overall, I think this editorial will be well received by the community.

Reviewer #2: I would like to first commend the authors on an excellent, well-written, informative article. At several points I found myself nodding along and thinking "that's certainly something I should do more often". I look forward to sharing the published version with colleagues and students. However, I do have some thoughts on improving the article still further.

It seems to me that the title is too specific. I'm not sure that these rules pertain to just workflow software? I generally don't write software with workflows in mind, but I found myself thinking that virtually every point in this article is still applicable to my work. It seems like many of these rules apply to scientific software in general. "10 Simple Rules for making a reusable software tool" for example?

Following on from the above, I might be inclined to reorder the rules as follows:

Rule 1: Make your tool simple to install Rule 2: Document your tool Rule 3: Make your tool maintainable Rule 4: Make sure a workflow engine can talk to your software easily Rule 5: Follow the principle of least surprise Rule 6: A software tool should just do one thing Rule 7: Make output reproducible Rule 8: Make your tool parallelisable Rule 9: Make your workflow tool a good citizen Rule 10: Carefully consider human interaction

Furthermore, I'm not sure there's quite 10 distinct rules here. There's a good deal of overlap between some - I've given more details below.

If I might make a suggestion for an alternative rule: "Make your software open source". The authors do make occasional reference to leaving code open, ensuring that code can be downloaded and the use of open source licenses, but I think the issue of open source is worthy of its own rule.

Finally, there's no conclusion? The article ends abruptly after Rule 10.

Specific comments on rules:

Rule 1: Make sure a workflow engine can talk to your software easily I particularly agree with the point about making options configurable at runtime. Far too often, I encounter software that requires code edits in order to change parameters. And it is surprising how many pieces of software default to placing outputs in (for example) the input directory, with no option to configure this behaviour.
Rule 2: Make your tool simple to install This is perhaps the single most important point and one that is so often overlooked, even by commercial scientific software providers. For example: https://help.codex.bio/codex/processor/installation/download-and-install
Rule 3: Document your tool Again, this is absolutely fundamental. So many people go to the trouble of putting their code up on GitHub (for example) with minimal accompanying documentation (or none at all in many cases), which completely defeats the purpose.
Rule 5: Follow the principle of least surprise I'm not sure this rule adds much? Aside from the fact that it possibly overlaps with Rules 8 and 10, shouldn't a properly-documented tool (Rule 3) always behave as a user expects?
Rule 7: Make your workflow tool a good citizen Again, there seems to be a lot of overlap here. I think Rules 6 and 7 could be combined under one heading, such as "Design your tool with shared infrastructure in mind"
Rule 8: Make output reproducible It never occurred to me that the outputs from different versions of my software should be verified using checksums - this is a great suggestion and I will begin doing so immediately!
Rule 9: Carefully consider human interaction I'm not sure about this rule. For me, a workflow or pipeline should always be fully automated and free of human interaction. Encouraging human interaction comes into conflict with rules 6 and 8 (which the authors, to their credit, do acknowledge: "...this sacrifices automation and reproducibility"). It's difficult to run multiple simultaneous instances of a workflow if human interaction is required at some point, while it would be difficult to ensure consistency and reproducibility in outputs if non-reproducible human interaction is incorporated in to the workflow. In my opinion, if human interaction is required, then that represents the end point of the workflow.

Reviewer #3: The authors submitted a "10 Rules" article intended to guide developers of applications that can be easily included into fully automated workflows, making them convenient to (re-)use, reproducible, and maintainable. The brief introduction is followed by general rules for interaction with the workflow engine (i.e. a software that runs the individual applications in a pipeline). Then, the authors make suggestions on the distribution and documentation of a software and its code. Concrete suggestions regarding the design and behavior of the software that further integration into automated and (semi-)manual pipelines are given afterwards. Finally, the authors urge readers to design their software as small reusable units.

The rules given in the article are useful for developers of workflow tools and point out many pitfalls that can impact the (re-)usability, maintainability, and ease of integration. There are only a few minor points that require additional clarifications, or where readers who are not fully into the topic would find context.

In the following paragraphs, these minor comments will be explained in detail:

The introduction gives a very brief and general explanation into the topic and succeeds in establishing the problem. However, it assumes that readers are familiar with the terminology and have themselves experience with workflows and related software. To lower this bar of entry, the term “workflow” should be defined and briefly introduced, as it is elemental to the article. Readers would also benefit from concrete examples for workflow engines, and software developed to be run in such an engine.

Rule 1 explains how a workflow-ready software should behave and how parameters should be provided. In the last paragraph, the authors explain the dangers of relying blindly on well-known software. An example is given in the Microsoft Office unattended RPA feature. A brief explanation of this feature would be helpful for readers that are only familiar with standard MS Office.

Rule 2 suggest to readers to make their software simple to install, for example by utilizing standardized package managers (concrete examples are given in the text). The authors correctly point out the dangers of not carefully managing dependencies, which can lead to a “dependency hell”. While the authors point to resources that go into detail of such an issue, a brief explanation and small example would emphasize the importance.

Rule 3 provides information on how developers should document their tools, and what kind of documentations should be available. This includes a change log where especially breaking changes should be listed. Readers could be made aware of deprecating their interfaces to ease the transition before a breaking change (this is entirely optional, as this is not fully related to documentation).

Rule 4 gives tips on making a tool maintainable and long-lived (i.e. they are actively used after their publication). The authors correctly point out that an appropriate versioning-scheme and version control software should be used, and that the project should be hosted on reliable platforms. Here, the authors should clarify that a self-hosted version control software is adequate during [the continued] development of a tool. The authors also point out to utilize OSI-approved open source licenses. Here, readers can be made aware that some institutes have policies regarding licensing and copyright, and that they should inform themselves before publishing code (this is optional).

Rule 5 suggests that software so that it behaves as expected. One such behavior is the usage of exit codes, where the authors suggest to utilize the code “0” for a successful run and “1” for errors. The suggestion by the author is not portable, as different OS have different standards on exit codes. Software developers should prefer functions or compile-time constants (usually provided by the programming language) that ensure that the appropriate error codes are returned.

Rule 6 and 7 warn developers to consider parallelization methods utilized by a workflow engine, leading to multiple runs of the software at the same time. One point is the creation of unique temporary directories to prevent multiple instances from interfering with each other - these directories should be then cleaned after the run. Here, the authors could suggest to developers to include an option to keep temporary files for debugging purposes (this is optional).

Rule 8 is about the reproducibility of results: Ideally, results should be equal on a byte-level for two different runs. The authors point to file hashing algorithms that can be utilized for verifying if this is the case. The usage of such a tool is determined “trivial”. It must be pointed out that for general (non-programmer; i.e. someone who will utilize the workflow) users file verification is not trivial at all, but can be easily learned. The authors also make readers aware of pitfalls in serializing outputs as text. Here, readers could be suggested to utilize standard data exchange formats such as JSON or XML where applicable.

Rule 9 talks about how developing an interactive software can be a benefit or disadvantage. Readers could be pointed out that there is the possibility of running an automated workflow first, to be reviewed by a human – the benefit being that the human can catch mistakes without fully relying on them to do the whole task (this is optional).

Rule 10 discusses the composition of tools, which should ideally do only one task. The authors could point to the “Unix philosophy” as a more concrete example (this is optional).

Reviewers who wish to reveal their identities to the authors and other reviewers should include their name here (optional). These names will not be published with the manuscript, should it be accepted.

Reviewer #1: (No Response)

Reviewer #2: David J Barry

Reviewer #3: (No Response)

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

stain commented 2 years ago

I've taken over as corresponding author.

20th December 2021 we sent in the response to authors and edited version. See https://zenodo.org/record/5901220

Paper now accepted and at proof stage - I'm checking the proof and responding by Wednesday 2022-03-09.

Publication info:

Paul Brack, Peter Crowther, Stian Soiland-Reyes, Stuart Owen, Douglas Lowe, Alan R Williams, Quentin Groom, Mathias Dillen, Frederik Coppens, Björn Grüning, Ignacio Eguinoa, Philip Ewels, Carole Goble (2022):
10 Simple Rules for making a software tool workflow-ready PLOS Computational Biology 18(3):e1009823 (In press)
[preprint] (to appear as https://doi.org/10.1371/journal.pcbi.1009823

stain commented 2 years ago

Corrected proof sent in including fixed figure (they broke resolution)

Will close issue as we know what the DOi will be. Will supposedly be published by 31st March 2022.

DiSSCo / SDR

Paper: 10 Simple Rules for making a software tool workflow-ready #19

Feedback from reviewers