Workshop overhaul - Githubissues

taylorreiter commented 6 years ago

We propose a change to the current lesson. The changes were born out of a recent DC Genomics workshop at UC Davis, and conversations and brainstorming sessions that occurred at CarpentryConnect West. These changes reflect conversations with @fpsom @crazyhottommy @ryanpeek @shannonekj @raynamharris @AstrobioMike @abostroem @perisateesh @tomsing1 @jthmiller @reedacartwright @tracykteal @Joiry and @adamjorr. It also reflects some changes suggested by @bluegenes (#41) and @standage (#42).

We welcome more community input as we move forward! We have forked DC Genomics repos to github.com/data-lessons, and will be developing there.

https://github.com/data-lessons/shell-genomics https://github.com/data-lessons/cloud-genomics https://github.com/data-lessons/organization-genomics https://github.com/data-lessons/wrangling-genomics

We propose:

Day 1: Introduction to command line for bioinformatics

Why shell? (use tools, automate)
Why of cloud computing? (more space. also note you need shell to cloud compute)
- remove "choosing a cloud section"
- NB this section will be quite short.
Cloud Genomics, Episode 2: Logging onto cloud
- Talk about command structure when sshing
Shell Genomics - on cloud, written around a text file. This could be the metadata file, that we reveal later. It could include all 2,443 Lenski samples. Meta-data here. Include fasta file as well.
- Episode 3 needs a rewrite. We think we need to cover cd, rm, head, tail, cat, print, mv, cp, grep, wc, less, man, scp (teach with cp), curl
  - Show grep by grepping for our 6 samples.
- We think this could be named "Exploring the Shell"
- @tomsing1 pirate treasure hunt to demonstrate folder structure in a rewarding way
- Add optional episode that includes cut, paste, sort, uniq, awk
Shell Genomics, Episode 4: Pipes & Redirection
- right now includes >, |, sort, wc, and uniq, cut, paste
- We think it should only include > and |
Shell Genomics, Episode 05: Writing Scripts.
- Change name to "Writing For loops & scripts"
- Don't write a script using history.
- Write the script in nano
- Modify to include for loop, addressing variables ($) and arguments ($1)
- Also use print in the for loop, like @ctb's Beginner Unix lesson.
- For non-novice learners optional: Introduce tmux/screen, perhaps with for loops.
- Consider making two episodes

Day 02: Genomics Workflow

Move Shell Genomics, Episode 6: Project Organization to Wrangling Genomics: Variant Calling Workflow
Project organization and management
- Data Tidiness
- nix formal Genomics Organization, Episode 02: Planning for NGS Projects, roll this info into data tidiness where we will have a more relevant spreadsheet to work with
Download data instead of moving from hidden files. Download a subsampled dataset that we post on figshare. Note that figshare is acting as our backup.
- 90% of chromosome can be thrown out with high coverage of 10%
Wrangling Genomics, Episode 01: Assessing Read Quality
- Back up plan with cyberduck/filezilla
Wrangling Genomics, Episode 02: Trimming and Filtering
- Add a section to show the Trimmomatic manual.
Wrangling Genomics, Episode 03: Variant Calling Workflow
- add information on all of the flags used in the different commands
Wrangling Genomics, Episode 04: Automating a Variant Calling Workflow
- Only live code the "Automating QCing" section.
- Allow the learners to download a full automated script to look at
Move "Genomics Organization, Episode 04: Examining Data on the NCBI SRA Database" to the end of Day 2, and include other resources. Demonstrate finding the SRR accession number in the paper, searching for it in the ENA, and downloading a fastq file with curl.
- This is also nice bc people are tired at the end of day, and we can give them goodies here :)

Additional suggestions

use GitBash instead of PuTTy. Include pasting instructions in GitBash, and note that open and man don't work in GitBash. Relates to https://github.com/datacarpentry/genomics-workshop/issues/41
Change the dataset to longer reads (~150bp) from Lenski lab as suggested in#42. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4988878/)
We would like to add a project narrative that includes details of the Lenski experiment.

JasonJWilliamsNY commented 6 years ago

How will these changes reflect the forthcoming R lessons. The R maintainers have been condensing everything into a one-day workshop that is paired with a one day Unix workshop? I have taught 10+ Genomics workshops and always in the One-day R/One-day Unix format?

taylorreiter commented 6 years ago

@JasonJWilliamsNY I don't think there will be any direct conflicts with the forthcoming R lessons. One potential conflict is that we are proposing to move the discussion of data tidiness to the beginning of the second day of this workshop instead of at the beginning of the first day.

The re-write of the unix lesson will use a metadata file derived from this supplement to demonstrate some of the commands, as well as the REL606 fasta file. Although the metadata file I am proposing we use is different than the one that is used in the R lessons, they are both relevant to REL606 and the E. coli story. Do you foresee the use of this other metadata file creating issues or confusion?

Instead of telling the cit+/- phenotype story, I am planning to interweave the story of hypermutablity during the variant calling workflow. This is an especially rewarding biologial story for variant calling, and the hypermutable strains accumulate mutations much more quickly than the other strains, and this can be observed by the number of variants called in the vcf file using the commands learned during the shell lesson.

naupaka commented 6 years ago

@taylorreiter @JasonJWilliamsNY One of the starting points for our lessons is the set of VCF files from the Lenski data. Is the pipeline script that produces those files going to change substantially, or just the interpretation of the calls?

taylorreiter commented 6 years ago

@JasonJWilliamsNY @naupaka The pipeline is being updated to more recent versions of the tools, and the input files are changing to longer reads (~150bp).

So far we have selected these SRR files. Number of propagation are in parenthesis: ARA+3 (non mutator) SRR2588658 (500) SRR2584668 (500) SRR2584669 (1000) SRR2591034 (1000)

ARA-3 (mutator) SRR2584683 (20000) SRR2584684 (20000) SRR2584685 (30000) SRR2588848 (30000)

I see that your lessons rely on designated Ara-3, so hopefully even though the calls from the vcf file will likely be different, it will not impact the narrative and code for your lessons. I have not produced the new vcf file yet, but it is on my list of things to do in the next day or two. I can attach it here if that will be helpful!

We had talked about subsampling our trimmed reads to one gene instead of to 3x coverage as an alternate way of having the pipeline run faster that would render better in samtools tview, would this impact your lesson? We have not implemented this yet, and so it would be easy to not.

naupaka commented 6 years ago

It does sound like some of these changes will alter the lessons we are developing in terms of specifics, even if not in overall structure and flow. When will those VCF decisions get finalized? We should be able to work with whatever you all decide on, but we can't move forward until then on the parts of our lessons that are based on analyzing and visualizing those VCF results.

JasonJWilliamsNY commented 6 years ago

@naupaka @taylorreiter my overall concern is that if we are working towards a two-day genomics workshop with one full-day of R, are you working the Unix lessons into a one day format? This is actually a big decision and maybe we need to check with curriculum committee.

ErinBecker commented 6 years ago

@taylorreiter @naupaka @JasonJWilliamsNY - I'm working on organizing the agenda for the CAC meetings on the 24th and 25th and wanted to try to get some clarification on this proposed reorganization of the workshop.

The currently published Genomics workshop includes project organization and management, intro to the command line, data wrangling and processing, and intro to cloud computing. It is two days long and includes NO R.

From my understanding, the curriculum that @taylorreiter proposed above rearranges and makes significant changes to the existing workshop materials, but does NOT add any R content. It would stay a two-day long workshop.

The Genomics R Maintainers (including @JasonJWilliamsNY and @naupaka) have been working on putting together a curriculum for Genomics work with R. I was under the impression that this was meant to be a two day workshop, which included no (or very little) Unix, and was completely independent of the existing curriculum (in the sense of being able to be offered as a stand-alone workshop).

If I'm misunderstanding anything, please let me know. I'd like to make sure the agenda I put in front of the CAC is accurate.

naupaka commented 6 years ago

My understanding was that our plan was to have the R materials be the second day of a two day workshop - the idea was that we would start with the VCF file that was produced at the end of the first day. That way the learners attending the workshop get to go through the whole process from raw data to report, instead of stopping part way through the process.

taylorreiter commented 6 years ago

@ErinBecker your understanding of the proposed changes is correct. We do not propose to add any R material, and propose to make significant changes to existing material. As @naupaka points out, the conflict arises in that we have proposed to change the dataset we are using to one with longer, paired-end reads, which would change the VCF file that is output of the genomics lesson and acts as input to any subsequent R lessons. We have a beta version of the lesson that would create this new vcf file here: https://github.com/data-lessons/wrangling-genomics

We plan to update the shell/cloud/organization lessons soon.

mdehollander commented 5 years ago

I noticed the Curriculum Advisory Committee discussed this topic (https://github.com/datacarpentry/curriculum-advisors/blob/master/genomics/september-2018-genomics-minutes.md). Since we are planning to organise a Genomics Carpentry event in the Netherlands, I am interested to know the current status. Especially about this part in the minutes:

Consensus to move forward with new proposed dataset and tools, provided support from community members who proposed and/or Maintainers for Shell/Wrangling lessons.

Is there any current activity? Where can I follow the progress? Here? If I want to contribute, what is the best way?

naupaka commented 5 years ago

Current R lessons are in progress at https://github.com/carpentrieslab/genomics-r-intro, but are not yet even ready for an alpha release. I believe our target is to have the parts at least drafted by ~January or so. There is the current release version here, but that does not include R at the moment.

ErinBecker commented 5 years ago

These changes have been implemented and are now live!

datacarpentry / genomics-workshop

Workshop overhaul #53

Day 1: Introduction to command line for bioinformatics

Day 02: Genomics Workflow

Additional suggestions