dib-lab / 2020-workflows-paper

Strategies for leveraging workflow systems to streamline large-scale biological analyses
https://dib-lab.github.io/2020-workflows-paper
Other
6 stars 8 forks source link

Comments from Shannon -- April 20, 2020 #17

Closed shannonekj closed 4 years ago

shannonekj commented 4 years ago

I'm leaving these comments as an 'issue' in response to the pdf sent out on 17/04/2020 (commit 3704689a2b294ee9b2ac67c441362d6820092cf0)--it's a more detailed response than what we talked about at lab meeting.

My biggest take away(s):

It appears the paper has most of the skeleton of what you originally wanted to convey (of course this may have changed from the lab meeting on Monday) and is a good 30,000 foot overview but it glosses over a few of the most basic concepts and provides detail in some of the more specialized scenarios. As it is, the paper is a lot of text and not a lot of visuals and I fear that it may not be that useful as a piece of literature to learn from and I think it would benefit from a few more figures and a bit more detail in some of the tables (see suggestions below).

If this paper were to be a roadmap I think it would benefit from expanding the early sections to reference useful tutorials and provide specific examples. To me this paper has a similar goal to GGG298 as well as the underlying motivation of the GGG201b computational material––which was to setup anyone who consumes the material to have more resources and a better understanding of conducting research in data intensive biology––but is a more concise guide (i.e. not learning modules) to what one in the field should do...or at least know about.

I think a lot of people could benefit from a roadmap-type paper that is more of a guide for how to streamline data intensive biology workflows. That way the paper isn't trying to be a 'how-to' (which would require the 100's of pages) DIB more than it is trying to be a guide.

Page by page details:

Page 3 ––Table 1 -- Add Iso seq, linked reads (10X and loop genomics), long reads (ONT, pacbio) or not--maybe they fall under the euk sequencing but this table could also hold more information about some of the main sequencing technologies (I am happy to provide anything for euk genome assembly and pop gen if you'd like?) ––Table 1 -- rad seq should be RAD sequencing <- there are many flavors of RADseq such as ddRAD 2bRAD, hyRAD (this blog goes through the RAD dramas https://www.molecularecologist.com/2017/04/to-radseq-or-not-to-radseq/) ––the Vertebrate Genome Project has a database of phased chromosome-level assemblies called the GenomeArk where they host all of their data (raw seq files, intermediate scaffolded files and the final reference assembly)––could be good to add! ––The text of the first paragraph is a lot of lists of things. I think it might be easier to digest as a table with a bit of detail on each database/repository. ––Paragraph 2 –– other factors to consider are which individuals to include (this is crucial in pop gen when different individuals belong to different populations ) or what individual (in eukaryotic genome assembly you'd want the heterogametic sex) Page 4 ––Paragraph 1 –– another thing to record is the machine that sequenced the data, sequencing well number, sequencing plate number (we've see batch effects for both plate and well in my lab), ecological based experiments may also want to record site location, specimen age and the sex of the specimen ––Table 2: I like this list but if I read it as a consumer I'd probably want to know which are the best or easiest to use, or most capable/popular (i.e. which I'd be able to find assistance for from the scientific community/forums) ––Paragraph 2 –– I think this is a really valuable paragraph and might benefit from a figure or table to show exactly how a project's data can expand in size and what one might expect in a few sample scenarios [see a really rough table here https://github.com/ngs-docs/2020-GGG298/tree/master/Week5-project_organization_and_UNIX_shell#storing-data]). Also, it may be useful to touch on the amount of time one expects to be interacting with data....weeks? months? years. ––Data Management and Organization –– I think this section could benefit from some examples. It is one thing to say "think about organization" and a whole other thing to say "here are some ways to think about it" and then give a few methods for organizing data.

Page 5 ––Data storage and transfer –– if you're approaching introducing md5sums to the readers it may be useful to have an example md5sum checksum file as a figure, or go into a bit more detail on how an md5sum file is a list of unique file ID's with the file name so one can check the contents of the file. Right now it is unclear how an md5sum is different from/more accurate than looking at the size column of ls -l. You might also want to explain how if someone is generating novel sequencing data they can specifically ask for an md5sum from the sequencing center they used.

Page 6 ––Use consistent and descriptive names –– I think this section could use a figure or an example of what a file name might look like...for example if I worked with RADsequencing data from two different smelt species and the file names after I split my data based on barcode always looks like this:

Ht_1998_A01_BMAG043_ATGATA.fastq.gz
St_2000_B04_BMAG061_CTGGAA,fastq.gz

Where if -d_ then

Page 8 ––Figure 1 -- I really like this! What do the colors mean though? A part of me thinks this may benefit from a sliiiightly simpler analysis for someone who isn't familiar with DAGs but also it is awesome as is.

Page 9 ––Figure 2 --Also a great figure!

Page 13 ––Quality control your data -- The FastQC step won't work for all data types so it may be work mentioning that a lot of data you can use FastQC for but other data types (such as RAD sequencing data or 10X data) you'll need to use other methods to evaluate (I look for batch effect with PCAs with my RADseq/popgen data, and look at kmer spectra for signs of contamination in my 10X/genome assembly data [still figuring out the best method for PacBio]).

Page 17 ––Paragraph 1 -- It may be useful to specifically note that genome assemblies won't work with subset data. Maybe other analyses as well... not sure of which specifically but at least genome assemblies!

Page 18 ––Troubleshooting: how to help yourself and when to get help –– people wanting help from Stack may also look at SeqAnswers or BioStar. ––Same section -- people looking for help would also want to right out the command they ran to introduce the error

A few more things...

Second to last, I think the early section of the text could have a brief mention that there are a lot of options to carry out streamlined workflows and 1. A person doesn't have to implement everything all at once.... I've found learning one thing at a time and diving into it deeply, then moving onto the next new thing allows me to accumulate new skills while making progress on my project. And that individuals should PICK one software and go with it, rather than trying to learn all of them at once.

Lastly, I think most if not all sections could refer to documents/tutorials/blogs with more information on any of the topics.

taylorreiter commented 4 years ago

@shannonekj would you be willing to give the substantially updated text another read through and see if addressed the things you brought up previously? There are a few things you brought up before which I'm not sure fit within the reorganization/workflow emphasis we brought it, and we'd love to hear your opinion given the updated text!

I think we addressed the following comments:

Page 3 ––The text of the first paragraph is a lot of lists of things. I think it might be easier to digest as a table with a bit of detail on each database/repository.

Page 4 ––Data Management and Organization –– I think this section could benefit from some examples. It is one thing to say "think about organization" and a whole other thing to say "here are some ways to think about it" and then give a few methods for organizing data.

Page 5 ––Data storage and transfer –– if you're approaching introducing md5sums to the readers it may be useful to have an example md5sum checksum file as a figure, or go into a bit more detail on how an md5sum file is a list of unique file ID's with the file name so one can check the contents of the file. Right now it is unclear how an md5sum is different from/more accurate than looking at the size column of ls -l. You might also want to explain how if someone is generating novel sequencing data they can specifically ask for an md5sum from the sequencing center they used.

Page 6 ––Use consistent and descriptive names –– I think this section could use a figure or an example of what a file name might look like...for example if I worked with RADsequencing data from two different smelt species and the file names after I split my data based on barcode always looks like this: Ht_1998_A01_BMAG043_ATGATA.fastq.gz St_2000_B04_BMAG061CTGGAA,fastq.gz Where if -d then f1=Species ID (Hypomesus transpacificus and Spirinchus thaleichthys) f2=birth year f3= well number f4= plate number f5=individual barcode. That way I can easily grab or group individuals by any of those fields to run analyses like testing for batch effects....I know all of you guys know these things but I think it might be useful to show an example/have a visual to contextualize all of the words.

Page 8 ––Figure 1 -- I really like this! What do the colors mean though? A part of me thinks this may benefit from a sliiiightly simpler analysis for someone who isn't familiar with DAGs but also it is awesome as is.

Page 9 ––Figure 2 --Also a great figure!

Page 13 ––Quality control your data -- The FastQC step won't work for all data types so it may be work mentioning that a lot of data you can use FastQC for but other data types (such as RAD sequencing data or 10X data) you'll need to use other methods to evaluate (I look for batch effect with PCAs with my RADseq/popgen data, and look at kmer spectra for signs of contamination in my 10X/genome assembly data [still figuring out the best method for PacBio]).

Page 18 ––Troubleshooting: how to help yourself and when to get help –– people wanting help from Stack may also look at SeqAnswers or BioStar. ––Same section -- people looking for help would also want to right out the command they ran to introduce the error

Second to last, I think the early section of the text could have a brief mention that there are a lot of options to carry out streamlined workflows and 1. A person doesn't have to implement everything all at once.... I've found learning one thing at a time and diving into it deeply, then moving onto the next new thing allows me to accumulate new skills while making progress on my project. And that individuals should PICK one software and go with it, rather than trying to learn all of them at once

I don't think we addressed/not sure we addressed/not sure we still need to address:

Page 3 ––Table 1 -- Add Iso seq, linked reads (10X and loop genomics), long reads (ONT, pacbio) or not--maybe they fall under the euk sequencing but this table could also hold more information about some of the main sequencing technologies (I am happy to provide anything for euk genome assembly and pop gen if you'd like?) ––Table 1 -- rad seq should be RAD sequencing <- there are many flavors of RADseq such as ddRAD 2bRAD, hyRAD (this blog goes through the RAD dramas https://www.molecularecologist.com/2017/04/to-radseq-or-not-to-radseq/) ––the Vertebrate Genome Project has a database of phased chromosome-level assemblies called the GenomeArk where they host all of their data (raw seq files, intermediate scaffolded files and the final reference assembly)––could be good to add! ––Paragraph 2 –– other factors to consider are which individuals to include (this is crucial in pop gen when different individuals belong to different populations ) or what individual (in eukaryotic genome assembly you'd want the heterogametic sex) Page 4 ––Paragraph 1 –– another thing to record is the machine that sequenced the data, sequencing well number, sequencing plate number (we've see batch effects for both plate and well in my lab), ecological based experiments may also want to record site location, specimen age and the sex of the specimen ––Table 2: I like this list but if I read it as a consumer I'd probably want to know which are the best or easiest to use, or most capable/popular (i.e. which I'd be able to find assistance for from the scientific community/forums) ––Paragraph 2 –– I think this is a really valuable paragraph and might benefit from a figure or table to show exactly how a project's data can expand in size and what one might expect in a few sample scenarios [see a really rough table here https://github.com/ngs-docs/2020-GGG298/tree/master/Week5-project_organization_and_UNIX_shell#storing-data]). Also, it may be useful to touch on the amount of time one expects to be interacting with data....weeks? months? years.

Page 17 ––Paragraph 1 -- It may be useful to specifically note that genome assemblies won't work with subset data. Maybe other analyses as well... not sure of which specifically but at least genome assemblies!

shannonekj commented 4 years ago

Yes @taylorreiter I would be happy to!

bluegenes commented 4 years ago

FYI @shannonekj the paper has been moved over to the dib-lab organization. New manuscript html link: https://dib-lab.github.io/2020-workflows-paper

shannonekj commented 4 years ago

Comments round 1 have been updated or resolved. Closing issue!