dib-lab / 2020-workflows-paper

Strategies for leveraging workflow systems to streamline large-scale biological analyses
https://dib-lab.github.io/2020-workflows-paper
Other
6 stars 8 forks source link

Round 2 Comments #29

Closed shannonekj closed 4 years ago

shannonekj commented 4 years ago

I made these suggestions separate from the material in my PR (#30) because they seemed more discussion based or bigger than anything I modified in the PR.

GENERAL COMMENTS

A general note is sometimes you refer to "you" and other times you refer to "scientists" when talking about the options readers have.

When referring to software or websites, should there be a link to them??

Some sentences have "blah, blah and blah" and others have "blah, blah, and blah" I saw that it was the former initially and I started to change them, but then I noticed the sentence structure was the latter as the document went on. I'm happy to go through and standardize for which ever all both prefer.

SPECIFIC COMMENTS

04.workflows-and-software.md

Getting started with workflows

Choosing a workflow system

Table 1

This table is great. A couple things I would want to know if reading this are: Which can be run on a cluster? and what underlying languages do they use (if any)

Wrangling Scientific Software

05.project-management.md

Workflow-Based Project Management

Visualize your workflow

Version control your project

06.data-resource-management.md

Data and resource management for workflow-enabled biology

Table 2

Getting Started with sequencing data

Protect valuable data

Perform quality control at every step

Securing and managing appropriate computational resources

Table 3

Getting started with resource management

Gain quick insights using sketching algorithms

Use the right tools for your question This section would benefit from a wrap-up sentence describing that more is not necessarily better...it is just more and in data-intensive biology with have A LOT so oftentimes simplicity is usually better.

I think two more things to thing about in this section are:

  1. the sentiment that all software have biases too and if someone uses a piece of software they should engage in understanding its limitations.
  2. there are A LOT of software out there and one can spend endless amounts of time trying to use the newest tech or understand them all while oftentimes it is good to pick the piece of software that is currently the best for their research question and get a pipeline/workflow up and running to be able to make biological inferences rather than get bogged down in the vast amount of options.

FIGURE SUGGESTIONS

Figure 1

workflow_figure_reorder.pdf

Figure 2

conda_figure_ABC.pdf This figure does a good job representing Conda channels and YAML files! I think it could be useful to visualize how workflow systems integrate with Conda (see C.)--it may also allow for readers to understand that small packages with only a few pieces of software installed are useful! The caption could be something along the lines of "Conda integration with workflow systems YAML files containing only the software necessary for a specific step in the analysis can be called on by workflow systems." Additionally, I've added an "r" channel, since we talk about it in the text.

Figure 6

version_control_diff_lines.pdf

Figure 8

checksum_file.pdf

bluegenes commented 4 years ago

Thanks @shannonekj! Some comments on your points below:

A general note is sometimes you refer to "you" and other times you refer to "scientists" when talking about the options readers have.

  • good point. The goal was to use "researchers" whenever possible, but we also wanted to be accessible/not too formal, and there were some cases where the sentences didn't seem to work without "you." Are there any sentences that really stick out and need changing? Do you find the "you" too colloquial? Important to strike the right tone!

When referring to software or websites, should there be a link to them??

  • ideally upon first mention, yes. I am a bit worried about citation limits, but that's a "for later" problem. Could you point out any you notice so we can help add?

Some sentences have "blah, blah and blah" and others have "blah, blah, and blah" I saw that it was the former initially and I started to change them, but then I noticed the sentence structure was the latter as the document went on. I'm happy to go through and standardize for which ever all both prefer. I'm an oxford comma aficionado, changed most back to #2 :).

04.workflows-and-software.md

  • ROpenSci's Drake is not in Table 1

We wanted to make sure we mentioned Drake (probably quite useful/friendly for R folks!), but I don't actually know anyone using it, so was hesitant to put it in the table, which contains the four workflow systems we think are the most widely used right now. Maybe we just need to make the point that the four in the table are the most widely used? Thoughts @taylorreiter?

Table 1

This table is great. A couple things I would want to know if reading this are: Which can be run on a cluster? and what underlying languages do they use (if any)

hmm. I think CWL/WDL are their own languages? All can be run on clusters! I'll think about how to add this to the caption

Wrangling Scientific Software

  • ppg 1 = Should we add a sentence about how software management systems can be used without a workflow system?

Yes! Can you add one, please?

05.project-management.md

Workflow-Based Project Management

Visualize your workflow

  • The sentence that refers to Figure 5 may benefit from stating what the DAG is depicting (e.g. Plass assembly of a query neighborhood[?]) "For example, Figure {@fig:sgc_workflow} exhibits a modified Snakemake workflow visualization of a Plass assembly of a query neighborhood from a recent publication [@doi:10.1101/462788]."

@taylorreiter maybe you can add a sentence to figure 5 caption?

Version control your project

  • suggest linking or citing Zenodo

yes! I added this as a "suggestion" to your PR

06.data-resource-management.md

Data and resource management for workflow-enabled biology

Table 2

  • There isn't one paper that sticks out as a RAD-seq best practices, but here are a few that talk about various uses, things to be aware of and pitfalls of RAD seq doi: 10.1111/2041-210X.12700 doi: 10.3389/fgene.2019.00533 doi:10.1038/nrg.2015.28 doi: 10.1111/1755-0998.12669 doi: 10.1111/1755-0998.12677 doi: 10.1111/2041-210X.12700

  • other useful columns may be relative costs, biases, benefits & limitations of the sequencing type (for example RAD seq is fairly low cost, is good for non-model organisms but is limited in that it provides a non-random sampling of a genome)...biases could be lumped into benefits&limitaitons

Definitely -- We decided to take a minimal approach bc it would be very hard to produce a comprehensive table for even just the most common of sequencing approaches. But we'd be happy to have a more detailed table if you can think of a way to discuss each sequencing type + considerations for a common set of applications....

Getting Started with sequencing data

Protect valuable data

  • ppg 1 = the first sentence may benefit from mentioning that the metadata is important (& should be backed up) too! Or at least mentioning that one could carry out an analysis and not know how to interpret results w/o metadata!

good point - think we thought we discussed it earlier, but worth reiterating! Can you add pls?

Perform quality control at every step

  • the "Look for common biases in sequencing data" section could benefit from Table 2 having a column that lists "Biases" for each sequencing type
  • the "Check for contamination" sections could benefit from adding a method for how to detect each.

@taylorreiter I'll leave response to this section to you

  • the "Consider the costs and benefits of stringent quality control for your data" section -- the doi: 10.3389/fgene.2019.00533 that I listed above is a good pop gen example of how upstream filtering effects downstream conclusions. They looked at how filtration affected three different RADseq datasets. They found that the filtration of PCR duplicates and SNP filtering parameters affected the # of polymorphic loci they retrieved and degree of genetic differentiation in each dataset differently. Which, at least to me, motivates manual curation of SNP filtration parameters based on the research question trying to be addressed.

Great - can you write a phrase to link this in please? Either as the second clause of the for example ... isoform discovery sentence or as its own sentence, I think.

Securing and managing appropriate computational resources

  • ppg 1 = the word "cluster" is introduced here but hasn't been defined/explained before. I think the readers will likely know what a cluster is but it could be worth explicitly stating cluster essentially = HPC

good point - can you add a phrase or sentence introducing, pls?

Table 3

  • I know this is a table of the "cloud providers" but it might benefit from restructuring to "Compute resources" and listing HPCs for comparison. Then there could be a "host" column that denotes cloud vs ...not-cloud?

Hmm... I think in my head, the "get an account at your local university hpc" was in the text, then the cloud resources (where to look if you can't do that) is in the table. Maybe worth adding a single line to the table for "locally-managed HPC"? Need to get across the idea that those are restricted-access though, while the other resources are open to most for grant applications or payments. Table title of "Compute resources" is fine and we could change "Cloud Provider" --> "Provider" *column header). Thoughts @taylorreiter? It could also be good to demystify "cloud" by defining it as "any computer you log into over a network" -- maybe in caption?

Getting started with resource management

Gain quick insights using sketching algorithms

  • Coming from a place of knowing nothing about sketching algorithms & reading this section, I understand the usefulness from this paragraph but I don't feel more equipt to go out and use--could you provide more detail or citation of tools/examples of sketching algorithms. Do I have to create these algorithms or do they already exist in tools??

thanks - I'll attempt to bring it down into the practical realm

Use the right tools for your question This section would benefit from a wrap-up sentence describing that more is not necessarily better...it is just more and in data-intensive biology with have A LOT so oftentimes simplicity is usually better.

That would be great - can you try adding a first-pass version and I will edit if needed?

I think two more things to thing about in this section are:

  1. the sentiment that all software have biases too and if someone uses a piece of software they should engage in understanding its limitations.
  2. there are A LOT of software out there and one can spend endless amounts of time trying to use the newest tech or understand them all while oftentimes it is good to pick the piece of software that is currently the best for their research question and get a pipeline/workflow up and running to be able to make biological inferences rather than get bogged down in the vast amount of options.

Same comment - can you try adding a first-pass version of this (I'm guessing one sentence of each point) and I will edit as needed?

bluegenes commented 4 years ago

re: figures

Fig 1 - agree! I'll reorder the figure so the letters remain in order Fig 2 - I agree with your point, but think this figure might be a bit complex. Let's incorporate for now and discuss ways to simplify if possible? Fig 6 - good point, love the changes Fig 8 - i like it -- definitely gets the point across better. Thoughts @taylorreiter?

taylorreiter commented 4 years ago

I love figure 8! Thanks @shannonekj. @bluegenes, I think i address the things I'm tagged in below!

RE table 3,

Hmm... I think in my head, the "get an account at your local university hpc" was in the text, then the cloud resources (where to look if you can't do that) is in the table. Maybe worth adding a single line to the table for "locally-managed HPC"? Need to get across the idea that those are restricted-access though, while the other resources are open to most for grant applications or payments. Table title of "Compute resources" is fine and we could change "Cloud Provider" --> "Provider" *column header). Thoughts @taylorreiter? It could also be good to demystify "cloud" by defining it as "any computer you log into over a network" -- maybe in caption?

I like cloud -> any computer you log into over a network, and cloud provider -> provider.

RE:

the "Consider the costs and benefits of stringent quality control for your data" section -- the doi: 10.3389/fgene.2019.00533 that I listed above is a good pop gen example of how upstream filtering effects downstream conclusions. They looked at how filtration affected three different RADseq datasets. They found that the filtration of PCR duplicates and SNP filtering parameters affected the # of polymorphic loci they retrieved and degree of genetic differentiation in each dataset differently. Which, at least to me, motivates manual curation of SNP filtration parameters based on the research question trying to be addressed.

That citation should definitely be added! Does it need a lot of motivation, or can we just add it as a citation? I prefer the latter, but am fine if it does need motivation.

RE drake:

We wanted to make sure we mentioned Drake (probably quite useful/friendly for R folks!), but I don't actually know anyone using it, so was hesitant to put it in the table, which contains the four workflow systems we think are the most widely used right now. Maybe we just need to make the point that the four in the table are the most widely used? Thoughts @taylorreiter?

I also don't know anyone using it, but it was hyped a lot at RStudio Conf. I like the idea of adding a sentence of "most popular", but how do we support that? Citation count?

RE sgc fig:

The sentence that refers to Figure 5 may benefit from stating what the DAG is depicting (e.g. Plass assembly of a query neighborhood[?]) "For example, Figure {@fig:sgc_workflow} exhibits a modified Snakemake workflow visualization of a Plass assembly of a query neighborhood from a recent publication [@doi:10.1101/462788]."

Are we replacing this with a different workflow fig? If not, should the description be in the caption or in the main text?

shannonekj commented 4 years ago

Replying to @bluegenes

A general note is sometimes you refer to "you" and other times you refer to "scientists" when talking about the options readers have.

  • good point. The goal was to use "researchers" whenever possible, but we also wanted to be accessible/not too formal, and there were some cases where the sentences didn't seem to work without "you." Are there any sentences that really stick out and need changing? Do you find the "you" too colloquial? Important to strike the right tone!
  • Upon rereading, I think "you" does really makes sense for giving instructions, and (as a reader) I feel like I will be able to take the actions recommended by the paper. The researcher/scientist word choice seems to be in reference to the broader picture––which makes sense. All good!

When referring to software or websites, should there be a link to them??

  • ideally upon first mention, yes. I am a bit worried about citation limits, but that's a "for later" problem. Could you point out any you notice so we can help add?
  • adds
  • Workflow Description Language (WDL) -- a link for this one is located in the table but CWL had a citation as well
  • Rabix
  • Terra
  • Conda
  • PyCharm
  • RStudio
  • singularity
  • docker
  • GitHub
  • Git
  • Mercurial
  • GitLab
  • Bitbucket
  • Binder
  • Reprozip
  • WholeTale
  • Shiny
  • Plotly
  • Pavian Shiny App
  • vega-lite
  • INSDC
  • SRA
  • ENA
  • DDBJ
  • Google Drive
  • Backblaze
  • Box
  • Dropbox
  • Amazon Web Services
  • Galaxy
  • NIH Common Fund Data Ecosystem
  • each in Table 4 (Research cloud resources)
  • GenBank
  • R-Ladies
  • Gitter
  • Slack
  • Google Groups
  • Stack Overflow
  • Biostars
  • SEQanswers

This list is quite long... if reference's are a concern, depending on the journal we could provide some kind of box or table with the URLs for all of the software/websites/databases/apps & denote that it is in the box with a superscript character? maybe there is a better way. just thinking on the fly

Table 3

  • I know this is a table of the "cloud providers" but it might benefit from restructuring to "Compute resources" and listing HPCs for comparison. Then there could be a "host" column that denotes cloud vs ...not-cloud?

Hmm... I think in my head, the "get an account at your local university hpc" was in the text, then the cloud resources (where to look if you can't do that) is in the table. Maybe worth adding a single line to the table for "locally-managed HPC"? Need to get across the idea that those are restricted-access though, while the other resources are open to most for grant applications or payments. Table title of "Compute resources" is fine and we could change "Cloud Provider" --> "Provider" *column header). Thoughts @taylorreiter? It could also be good to demystify "cloud" by defining it as "any computer you log into over a network" -- maybe in caption?

+1 for "Cloud Provider" --> "Provider" if you think its worth adding in locally-managed HPC––I like that the table would give all the options if added but also understand of you wanted to keep it localized to the cloud. And absolutely +1 for giving a few words of detail to describe the "cloud".

I will address the following replies in my next PR:

04.workflows-and-software.md

Wrangling Scientific Software

  • ppg 1 = Should we add a sentence about how software management systems can be used without a workflow system?

Yes! Can you add one, please?

06.data-resource-management.md

Data and resource management for workflow-enabled biology

Table 2

  • There isn't one paper that sticks out as a RAD-seq best practices, but here are a few that talk about various uses, things to be aware of and pitfalls of RAD seq doi: 10.1111/2041-210X.12700 doi: 10.3389/fgene.2019.00533 doi:10.1038/nrg.2015.28 doi: 10.1111/1755-0998.12669 doi: 10.1111/1755-0998.12677 doi: 10.1111/2041-210X.12700

  • other useful columns may be relative costs, biases, benefits & limitations of the sequencing type (for example RAD seq is fairly low cost, is good for non-model organisms but is limited in that it provides a non-random sampling of a genome)...biases could be lumped into benefits&limitaitons

Definitely -- We decided to take a minimal approach bc it would be very hard to produce a comprehensive table for even just the most common of sequencing approaches. But we'd be happy to have a more detailed table if you can think of a way to discuss each sequencing type + considerations for a common set of applications....

Getting Started with sequencing data

Protect valuable data

  • ppg 1 = the first sentence may benefit from mentioning that the metadata is important (& should be backed up) too! Or at least mentioning that one could carry out an analysis and not know how to interpret results w/o metadata!

good point - think we thought we discussed it earlier, but worth reiterating! Can you add pls?

  • the "Consider the costs and benefits of stringent quality control for your data" section -- the doi: 10.3389/fgene.2019.00533 that I listed above is a good pop gen example of how upstream filtering effects downstream conclusions. They looked at how filtration affected three different RADseq datasets. They found that the filtration of PCR duplicates and SNP filtering parameters affected the # of polymorphic loci they retrieved and degree of genetic differentiation in each dataset differently. Which, at least to me, motivates manual curation of SNP filtration parameters based on the research question trying to be addressed.

Great - can you write a phrase to link this in please? Either as the second clause of the for example ... isoform discovery sentence or as its own sentence, I think.

Securing and managing appropriate computational resources

  • ppg 1 = the word "cluster" is introduced here but hasn't been defined/explained before. I think the readers will likely know what a cluster is but it could be worth explicitly stating cluster essentially = HPC

good point - can you add a phrase or sentence introducing, pls?

Use the right tools for your question This section would benefit from a wrap-up sentence describing that more is not necessarily better...it is just more and in data-intensive biology with have A LOT so oftentimes simplicity is usually better.

That would be great - can you try adding a first-pass version and I will edit if needed?

I think two more things to thing about in this section are:

  1. the sentiment that all software have biases too and if someone uses a piece of software they should engage in understanding its limitations.
  2. there are A LOT of software out there and one can spend endless amounts of time trying to use the newest tech or understand them all while oftentimes it is good to pick the piece of software that is currently the best for their research question and get a pipeline/workflow up and running to be able to make biological inferences rather than get bogged down in the vast amount of options.

Same comment - can you try adding a first-pass version of this (I'm guessing one sentence of each point) and I will edit as needed?

shannonekj commented 4 years ago

Re: Re: figures from @bluegenes

Re: figures

Fig 1 - agree! I'll reorder the figure so the letters remain in order

The reorder looks great!

Fig 2 - I agree with your point, but think this figure might be a bit complex. Let's incorporate for now and discuss ways to simplify if possible?

Absolutely.

Fig 6 - good point, love the changes

Yay, thanks!

Fig 8 - i like it -- definitely gets the point across better. Thoughts @taylorreiter?

woot!

shannonekj commented 4 years ago

Replying to @taylorreiter

RE:

the "Consider the costs and benefits of stringent quality control for your data" section -- the doi: 10.3389/fgene.2019.00533 that I listed above is a good pop gen example of how upstream filtering effects downstream conclusions. They looked at how filtration affected three different RADseq datasets. They found that the filtration of PCR duplicates and SNP filtering parameters affected the # of polymorphic loci they retrieved and degree of genetic differentiation in each dataset differently. Which, at least to me, motivates manual curation of SNP filtration parameters based on the research question trying to be addressed.

That citation should definitely be added! Does it need a lot of motivation, or can we just add it as a citation? I prefer the latter, but am fine if it does need motivation.

I think just adding the citation after sentence 4 or 5 would be fine but perhaps we'd want a similar paper but from another sequencing type?

RE drake:

We wanted to make sure we mentioned Drake (probably quite useful/friendly for R folks!), but I don't actually know anyone using it, so was hesitant to put it in the table, which contains the four workflow systems we think are the most widely used right now. Maybe we just need to make the point that the four in the table are the most widely used? Thoughts @taylorreiter?

I also don't know anyone using it, but it was hyped a lot at RStudio Conf. I like the idea of adding a sentence of "most popular", but how do we support that? Citation count?

Hmm then keeping this section as is in relation to my comment seems fine. Though, now I'm thinking the "(see Table 1)" should be after sentence 1 as the table contains the workflow systems with some reference docs but does not contain the strengths or how they meet computing goals differently. Thoughts on moving?

RE sgc fig:

The sentence that refers to Figure 5 may benefit from stating what the DAG is depicting (e.g. Plass assembly of a query neighborhood[?]) "For example, Figure {@fig:sgc_workflow} exhibits a modified Snakemake workflow visualization of a Plass assembly of a query neighborhood from a recent publication [@doi:10.1101/462788]."

Are we replacing this with a different workflow fig? If not, should the description be in the caption or in the main text?

To me it makes sense to be in the caption. The main text is speaking in more broad terms when it refers to the figure.

From @bluegenes comment

Perform quality control at every step

  • the "Look for common biases in sequencing data" section could benefit from Table 2 having a column that lists "Biases" for each sequencing type
  • the "Check for contamination" sections could benefit from adding a method for how to detect each.

@taylorreiter I'll leave response to this section to you

Any thoughts @taylorreiter?

taylorreiter commented 4 years ago

43 adds radseq refs.

taylorreiter commented 4 years ago

Hi @shannonekj , I think we addressed the majority of these points in #43, #40, #39, #37, #30. If you think we missed anything important, please open a new issue, or one issue per item that we missed. Thank you for all of your feedback/contributions!!