Standard workflow for the "Introduction to RNA-seq" episode

almeidasilvaf commented 1 year ago

Hello, everyone.

The Introduction to RNA-seq is currently empty, so I would like to contribute to it.

As far as I understood it, this episode should contain instructions on how to go from raw FASTQ files to a matrix of transcript abundances, including pre-processing steps (e.g., sequence QC, trimming adapters and low-quality sequences, etc).

However, as there are several options of software tools to use in each step of the pipeline, I think we should first agree on a workflow to use. I think we can build on the Bioc workflow package rnaseqGene. My suggested workflow would be:

QC, trimming and filtering with fastp
Quantification of transcript abundances with salmon
Data import with tximport - maybe this could go to the beginning of the Importing and annotating quantified data into R episode?

I'd love to hear what you all think.

Best, Fabricio

csoneson commented 1 year ago

Hi Fabricio,

thanks a lot - that would be great! On my side, I agree with your suggested workflow (my preference though would be to directly use tximeta rather than tximport as it involves less downstream fiddling to get the object into a suitable shape for further analysis). It would probably also be good to have a short introduction on the biological/technological side - describing what we are actually measuring with RNA-seq.

almeidasilvaf commented 1 year ago

Thank you for your feedback, Charlotte.

Indeed, tximeta would be better. I will see what I can do to avoid having to download huge FASTQ files from ENA to use in the episode. I think there might be some nice FASTQ files on ExperimentHub that I can use.

I will also try to write a short intro on what RNA-seq is.

jdrnevich commented 1 year ago

Hi Fabricio,

Thank you for you interest in helping to develop the rnaseq workshop materials – they definitely need it!

One issue almost everyone struggles with in regards to RNA-Seq analysis is that the first part, trimming, alignment and count generation, is typically done outside of R (although this isn’t absolute) and requires more computing resources than a laptop has. The second part of QC, statistical analyses and data mining can very easily be done in R on a laptop.

So how can we teach the first part in the scope of a 2-day Carpentries workshop? I am not sure it is possible to fully do so in the time frame provided. Many workshops/workflows/vignettes (including rnaseqGenehttps://urldefense.com/v3/__https:/bioconductor.org/packages/release/workflows/html/rnaseqGene.html__;!!DZ3fjg!7G3V-22W0vN6qtSF3xQgnLr-h-QQRAcAEXmNJWfZ7BQjLQKeNm6gUCKZp-gzIKFo9pdPR-_HLjJBC_TlpJKS7iH_BX4$ that you cited) only give examples of codes that can be used on a cluster, but do not go through how to actually use them. There is also the issue of cluster resources – some people like me have access to institutional resources but each will have very specific ways to access them, different schedulers, and may or may not already have the necessary software installed. There is also the possibility of doing it all on AWS like the Data Carpentries’ Genomics Curriculum does. However, a scheduled workshop will already have the AWS instance set up with all software installed so they do not learn how to do it on their own. The set-up instructionshttps://datacarpentry.org/genomics-workshop/setup.html does go through how to set up your own AWS and has very detailed instructions herehttps://datacarpentry.org/genomics-workshop/AMI-setup/, and also ways to use on your own local machine (MacOS or Linux only) but nothing on memory/processing requirements. And both of these, IMO, would be very difficult for a beginner to actually do on their own.

How were you thinking to incorporate the fastp and salmon into the current workshop? We could just do what others do and talk about the issues/things to be aware of and just give examples of codes that could be run, but not actually try to run them in the workshop. This is still valuable knowledge and maybe all we can do in this context.

I have long had the idea to develop a practical workshop going through actually setting up an AMS instance, getting the required software, uploading fastq files, running trimming + quantification, and downloading counts. This would fit in the spirit of democratization of bioinformatics, although in practice would require people to have access to a credit card. But I have very little experience with AMS myself because I do have access to a cluster that someone else installs all the software I need, and that is what I teach others at my institution to use.

I tagged you on a slack thread we had yesterday discussing this very issue. I would love to continue this discussion there, over email and/or in the bioc-teaching monthly calls (although I will not be able to attend next Monday due to the holiday here). We should also get the new carpentries instructors involved, especially those not in Westernized countries, on how these skills can be taught locally.

I look forward to future discussions on this! Jenny

From: Fabrício Almeida-Silva @.> Sent: Friday, January 13, 2023 7:15 AM To: carpentries-incubator/bioc-rnaseq @.> Cc: Subscribed @.***> Subject: Re: [carpentries-incubator/bioc-rnaseq] Standard workflow for the "Introduction to RNA-seq" episode (Issue #16)

Thank you for your feedback, Charlotte.

Indeed, tximetahttps://urldefense.com/v3/__https:/bioconductor.org/packages/tximeta__;!!DZ3fjg!9g3wyG57B1QLYZWRAXtIEjLDFK3k4sPXbK4Soo1WglwMKW2Y6OpLRD8-sv7oiyRrC8Oa_ne3QfHcvsFLujYzOrAMR-M$ would be better. I will see what I can do to avoid having to download huge FASTQ files from ENA to use in the episode. I think there might be some nice FASTQ files on ExperimentHub that I can use.

I will also try to write a short intro on what RNA-seq is.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/carpentries-incubator/bioc-rnaseq/issues/16*issuecomment-1381836997__;Iw!!DZ3fjg!9g3wyG57B1QLYZWRAXtIEjLDFK3k4sPXbK4Soo1WglwMKW2Y6OpLRD8-sv7oiyRrC8Oa_ne3QfHcvsFLujYzvq2GpnY$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/ACREQWPZSFSY3RVEDJXLYFLWSFINRANCNFSM6AAAAAAT2FX3ZU__;!!DZ3fjg!9g3wyG57B1QLYZWRAXtIEjLDFK3k4sPXbK4Soo1WglwMKW2Y6OpLRD8-sv7oiyRrC8Oa_ne3QfHcvsFLujYz7gs5zuE$. You are receiving this because you are subscribed to this thread.Message ID: @.***>

almeidasilvaf commented 1 year ago

Thank you for bringing these points up, Jenny.

fastp and salmon can be run on a laptop without problems (I myself have done it on an Ubuntu laptop with 8 GB RAM). The issue here might be compatibility with multiple platforms. I have not tried installing fastp and salmon on Windows and macOS, so I'm not sure if that would be an issue. I will try that asap and let you know.

I saw the discussion on Slack, and I will try to think of solutions to this issue, from using Orchestra to Desktop.

carpentries-incubator / bioc-rnaseq

Standard workflow for the "Introduction to RNA-seq" episode #16