datacarpentry / genomics-workshop

Genomics Workshop Overview
https://datacarpentry.org/genomics-workshop
Other
45 stars 97 forks source link

Workshop overhaul #53

Closed taylorreiter closed 5 years ago

taylorreiter commented 6 years ago

We propose a change to the current lesson. The changes were born out of a recent DC Genomics workshop at UC Davis, and conversations and brainstorming sessions that occurred at CarpentryConnect West. These changes reflect conversations with @fpsom @crazyhottommy @ryanpeek @shannonekj @raynamharris @AstrobioMike @abostroem @perisateesh @tomsing1 @jthmiller @reedacartwright @tracykteal @Joiry and @adamjorr. It also reflects some changes suggested by @bluegenes (#41) and @standage (#42).

We welcome more community input as we move forward! We have forked DC Genomics repos to github.com/data-lessons, and will be developing there.

https://github.com/data-lessons/shell-genomics https://github.com/data-lessons/cloud-genomics https://github.com/data-lessons/organization-genomics https://github.com/data-lessons/wrangling-genomics

We propose:

Day 1: Introduction to command line for bioinformatics

Day 02: Genomics Workflow

Additional suggestions

JasonJWilliamsNY commented 6 years ago

How will these changes reflect the forthcoming R lessons. The R maintainers have been condensing everything into a one-day workshop that is paired with a one day Unix workshop? I have taught 10+ Genomics workshops and always in the One-day R/One-day Unix format?

taylorreiter commented 6 years ago

@JasonJWilliamsNY I don't think there will be any direct conflicts with the forthcoming R lessons. One potential conflict is that we are proposing to move the discussion of data tidiness to the beginning of the second day of this workshop instead of at the beginning of the first day.

The re-write of the unix lesson will use a metadata file derived from this supplement to demonstrate some of the commands, as well as the REL606 fasta file. Although the metadata file I am proposing we use is different than the one that is used in the R lessons, they are both relevant to REL606 and the E. coli story. Do you foresee the use of this other metadata file creating issues or confusion?

Instead of telling the cit+/- phenotype story, I am planning to interweave the story of hypermutablity during the variant calling workflow. This is an especially rewarding biologial story for variant calling, and the hypermutable strains accumulate mutations much more quickly than the other strains, and this can be observed by the number of variants called in the vcf file using the commands learned during the shell lesson.

naupaka commented 6 years ago

@taylorreiter @JasonJWilliamsNY One of the starting points for our lessons is the set of VCF files from the Lenski data. Is the pipeline script that produces those files going to change substantially, or just the interpretation of the calls?

taylorreiter commented 6 years ago

@JasonJWilliamsNY @naupaka The pipeline is being updated to more recent versions of the tools, and the input files are changing to longer reads (~150bp).

So far we have selected these SRR files. Number of propagation are in parenthesis: ARA+3 (non mutator) SRR2588658 (500) SRR2584668 (500) SRR2584669 (1000) SRR2591034 (1000)

ARA-3 (mutator) SRR2584683 (20000) SRR2584684 (20000) SRR2584685 (30000) SRR2588848 (30000)

I see that your lessons rely on designated Ara-3, so hopefully even though the calls from the vcf file will likely be different, it will not impact the narrative and code for your lessons. I have not produced the new vcf file yet, but it is on my list of things to do in the next day or two. I can attach it here if that will be helpful!

We had talked about subsampling our trimmed reads to one gene instead of to 3x coverage as an alternate way of having the pipeline run faster that would render better in samtools tview, would this impact your lesson? We have not implemented this yet, and so it would be easy to not.

naupaka commented 6 years ago

It does sound like some of these changes will alter the lessons we are developing in terms of specifics, even if not in overall structure and flow. When will those VCF decisions get finalized? We should be able to work with whatever you all decide on, but we can't move forward until then on the parts of our lessons that are based on analyzing and visualizing those VCF results.

JasonJWilliamsNY commented 6 years ago

@naupaka @taylorreiter my overall concern is that if we are working towards a two-day genomics workshop with one full-day of R, are you working the Unix lessons into a one day format? This is actually a big decision and maybe we need to check with curriculum committee.

ErinBecker commented 6 years ago

@taylorreiter @naupaka @JasonJWilliamsNY - I'm working on organizing the agenda for the CAC meetings on the 24th and 25th and wanted to try to get some clarification on this proposed reorganization of the workshop.

The currently published Genomics workshop includes project organization and management, intro to the command line, data wrangling and processing, and intro to cloud computing. It is two days long and includes NO R.

From my understanding, the curriculum that @taylorreiter proposed above rearranges and makes significant changes to the existing workshop materials, but does NOT add any R content. It would stay a two-day long workshop.

The Genomics R Maintainers (including @JasonJWilliamsNY and @naupaka) have been working on putting together a curriculum for Genomics work with R. I was under the impression that this was meant to be a two day workshop, which included no (or very little) Unix, and was completely independent of the existing curriculum (in the sense of being able to be offered as a stand-alone workshop).

If I'm misunderstanding anything, please let me know. I'd like to make sure the agenda I put in front of the CAC is accurate.

naupaka commented 6 years ago

My understanding was that our plan was to have the R materials be the second day of a two day workshop - the idea was that we would start with the VCF file that was produced at the end of the first day. That way the learners attending the workshop get to go through the whole process from raw data to report, instead of stopping part way through the process.

taylorreiter commented 6 years ago

@ErinBecker your understanding of the proposed changes is correct. We do not propose to add any R material, and propose to make significant changes to existing material. As @naupaka points out, the conflict arises in that we have proposed to change the dataset we are using to one with longer, paired-end reads, which would change the VCF file that is output of the genomics lesson and acts as input to any subsequent R lessons. We have a beta version of the lesson that would create this new vcf file here: https://github.com/data-lessons/wrangling-genomics

We plan to update the shell/cloud/organization lessons soon.

mdehollander commented 5 years ago

I noticed the Curriculum Advisory Committee discussed this topic (https://github.com/datacarpentry/curriculum-advisors/blob/master/genomics/september-2018-genomics-minutes.md). Since we are planning to organise a Genomics Carpentry event in the Netherlands, I am interested to know the current status. Especially about this part in the minutes:

Consensus to move forward with new proposed dataset and tools, provided support from community members who proposed and/or Maintainers for Shell/Wrangling lessons.

Is there any current activity? Where can I follow the progress? Here? If I want to contribute, what is the best way?

naupaka commented 5 years ago

Current R lessons are in progress at https://github.com/carpentrieslab/genomics-r-intro, but are not yet even ready for an alpha release. I believe our target is to have the parts at least drafted by ~January or so. There is the current release version here, but that does not include R at the moment.

ErinBecker commented 5 years ago

These changes have been implemented and are now live!