DimmestP / chimera_project_manuscript

1 stars 2 forks source link

Draft plan for QuantSeq analysis #127

Closed ewallace closed 2 years ago

ewallace commented 3 years ago

Draft plan for QuantSeq analysis, as very briefly discussed with @daneckaw and @DimmestP on 20th May meeting.

Here is my brain dump draft of what I would like to see. Let's discuss, adjust, agree on a better plan.

Goals:

  1. Check that the RNA-seq actually worked, i.e. all necessary quality controls.
  2. Quantify the impact of promoters and motifs on polyadenylation site. (major goal)
  3. Use RNA-seq as independent readout of construct expression levels. (minor goal)

Ideally, we would also develop a Quant-Seq rev pipeline that can be re-used for future work, but it's more important to have one that works and is useable for multiple people involved in this present paper.

Tasks

-1. Download data, verify checksums, write protect, backup

  1. Make a git repository to contain just the QS analysis.
  2. Decide on alignment (fasta file + gff file) strategy.
  3. Process reads as much as possible in a single workflow script, through bam files to read counts and maybe even 3' ends?
  4. QC of gross features: assess total reads, aligned reads, adapter and overrepresented reads per sample.
  5. QC of read counts: are RNA-seq reads for genomic reads reproducible per sample?
  6. Plot normalized read counts on reporter constructs across samples. Are we detecting the right constructs in (only) the right samples?
  7. Plot polyadenylation sites on reporter constructs and some control genes.
  8. Submit to gene expression omnibus.

-1. Download data, verify checksums, write protect, backup

i. Download data as zip file to datastore/wallace_rna/bigdata/fastq folder. ii. Keep this zip file. iii. Unzip a copy of the zip file, and md5 checksum the fastq contents. iv. Verify that the files have downloaded correctly by comparing the new md5 sums with the copy originally unzipped. If they don't match, delete the zip file and download again (step i.). When the checksums match for all fastq files, move to next step. v. Make the zip file read-only with chmod. vi. Add zip file name and metadata to datastore/wallace_rna/bigdata/fastq/README.md vii. Make a backup to another location, just in extra case. These data are expensive and irrepleaceable. Note down where.

These instructions need to be added to hopefully agree with those in the manual about where data belongs. Let's double-check.

0. Make a git repo to contain just the QS analysis

It should be clearly organised and accessible to all of us working on it. We can figure out how/if to blend into this repo (chimera_project_manuscript) afterwards.

1. Decide on alignment (fasta file + gff file) strategy.

The point is that we have different sequences for each construct, so either we have to construct one fasta file that we align all the samples to, or we have to align (almost) every sample to a different fasta file, which sounds worse. What did @daneckaw do for the 5P-seq? I don't think documentation was shared, that is issue #66.

I would try first doing a genomic-augmented alignment with a fasta file containing:

That requires a matched gff file containing:

Alternatively it might be a good idea to align reads just to a transcriptome fasta constructed similarly; then it's easier to count ambiguous reads with salmon after alignment. I don't know. But it would still be important to align to the genome at some point for QC.

A reminder that my "best I could come up with" version of the yeast transcriptome is at yeastutrgff.

2. Process reads as much as possible in a single workflow script, through bam files to read counts and maybe even 3' ends?

Life will be so much easier if we can put this in a workflow script early. I'm a fan of nextflow. My recent "best attempt" for single-end quant-seq fwd is here: CryptoGat201RNASeq_draft/src/quantseqfwd.nf. (That means that it should run on the read 2-only for the QSrev files.) This does a number of nice things including automatically generating a multiQC report for fastqc reads and (hisat2) alignment statistics.

Note: the parameters params.featuretype, params.featurename may need to be adjusted to refer to the feature names and types actually used in the gff file.

The nextflow nf-core rnaseq pipeline covers paired-end reads, but is written in nextflow DSL 2.0 which last time I looked was badly documented or maybe I just didn't understand it. Version 1.4.2 of the nf-core rnaseq pipeline is in the previous version of nextflow.

In short, it would be extremely useful to have a pipeline script, ideally in nextflow, that:

3. QC of gross features: assess total reads, aligned reads, adapter and overrepresented reads per sample.

If we include multiQC in the pipeline then it's only a short step to having a supplementary table of read statistics per sample. And the multiQC report to check overrep seqs etc. Let's try that.

4. QC of read counts: are RNA-seq reads for genomic reads reproducible per sample?

A heatmap of correlations between all the samples would be great. We could also try checking in more detail with GGally::ggpairs(). At least a few log-log scatter plots would be great.

5. Plot normalized read counts on reporter constructs across samples. Are we detecting the right constructs in (only) the right samples?

Ideally our alignment strategy and pipeline produces a matrix with the reads on every native yeast gene, and URA3, and every construct. And the scatter plot/ heatmap shows that overall gene-by-gene counts are conserved. Then we should show the normalized counts for each reporter construct, hoping that we don't detect any of them in the negative control (POT1-ccdb) and we detect the variants only in the sample they come from.

For normalizing the reads, I would try in this order:

Either DESeq2 or salmon could also give us a significance value for fold-changes on the constructs, but that might require some re-naming of the construct column as for DE purposes we want to treat them as the same, but for alignment and QC purposes we want to treat (some of) them as different.

Output should be a plot with x-axis sample ID and y-axis the normalized counts. Probably different shapes for the bioreps? And we should make a couple of comparison plots on control genes to check their stability with the same metric.

We need to take extra care to find out if reads on native RPS3 (in POT1-ccdb negative control sample) align to reporter constructs, in addition to looking for cross-alignments.

At some point here or earlier in the QC, we are likely to find out that the alignment parameters need to be changed for one reason or another. We might even have to change the alignment strategy. So we will be extra-grateful that we can rerun the whole pipeline in a single script.

6. Plot polyadenylation sites on reporter constructs and some control genes.

First check this visually by genome browser plot, probably also with the .bam file reads, to see in all its gory glory.

For the paper, I love the cumulative distribution plots that @daneckaw made for the 5Pseq with x-axis the position of the polyadenylation site near the stop codon, and y-axis the cumulative proportion of 3' ends of reads. Was that made from the bamfile in R or did you do something like a bedtools genomecov -3 first? See genomecov documentation.

Ideally, code to produce the data frame for the cumulative distribution plot would be wrapped in some nice reusable function.

So we would like those plots for:

7. Submit to gene expression omnibus

Submit to GEO as soon as possible before we forget anything.

Next steps

DimmestP commented 3 years ago

Initial work plan: Weronika will download the reads when they are ready and do the checksum. The data will be on datastore and will be mountable on bifx.

Sam will then create the initial nextflow pipeline and run the alignment and quality control.

Weronika and Sam then will complete the final QC and produce the final graphs comparing polyA sites across constructs.