Draft plan for QuantSeq analysis, as very briefly discussed with @daneckaw and @DimmestP on 20th May meeting.

Here is my brain dump draft of what I would like to see. Let's discuss, adjust, agree on a better plan.

Goals:

Check that the RNA-seq actually worked, i.e. all necessary quality controls.
Quantify the impact of promoters and motifs on polyadenylation site. (major goal)
Use RNA-seq as independent readout of construct expression levels. (minor goal)

Ideally, we would also develop a Quant-Seq rev pipeline that can be re-used for future work, but it's more important to have one that works and is useable for multiple people involved in this present paper.

Tasks

-1. Download data, verify checksums, write protect, backup

Make a git repository to contain just the QS analysis.
Decide on alignment (fasta file + gff file) strategy.
Process reads as much as possible in a single workflow script, through bam files to read counts and maybe even 3' ends?
QC of gross features: assess total reads, aligned reads, adapter and overrepresented reads per sample.
QC of read counts: are RNA-seq reads for genomic reads reproducible per sample?
Plot normalized read counts on reporter constructs across samples. Are we detecting the right constructs in (only) the right samples?
Plot polyadenylation sites on reporter constructs and some control genes.
Submit to gene expression omnibus.

-1. Download data, verify checksums, write protect, backup

i. Download data as zip file to datastore/wallace_rna/bigdata/fastq folder. ii. Keep this zip file. iii. Unzip a copy of the zip file, and md5 checksum the fastq contents. iv. Verify that the files have downloaded correctly by comparing the new md5 sums with the copy originally unzipped. If they don't match, delete the zip file and download again (step i.). When the checksums match for all fastq files, move to next step. v. Make the zip file read-only with chmod. vi. Add zip file name and metadata to datastore/wallace_rna/bigdata/fastq/README.md vii. Make a backup to another location, just in extra case. These data are expensive and irrepleaceable. Note down where.

These instructions ~~need to be added to~~ hopefully agree with those in the manual about where data belongs. Let's double-check.

0. Make a git repo to contain just the QS analysis

It should be clearly organised and accessible to all of us working on it. We can figure out how/if to blend into this repo (chimera_project_manuscript) afterwards.

1. Decide on alignment (fasta file + gff file) strategy.

The point is that we have different sequences for each construct, so either we have to construct one fasta file that we align all the samples to, or we have to align (almost) every sample to a different fasta file, which sounds worse. What did @daneckaw do for the 5P-seq? I don't think documentation was shared, that is issue #66.

I would try first doing a genomic-augmented alignment with a fasta file containing:

all the yeast chromosomes
POT1-ccdb negative control plasmid (gives URA3 - oh hold on a minute, URA3 is in the genome fasta too, need to check that and the 3'UTR on it)
A segment for each alternative reporter gene, containing just the CDS and Xnt downstream of the stop codon. X might be 200nt or 400nt or something like that?

That requires a matched gff file containing:

yeast mRNAs, including (approximate) 3'UTRs
URA3 on POT1-ccdb, including an approximate 3'UTR
an entry for each reporter gene
consistent and helpful feature names.

Alternatively it might be a good idea to align reads just to a transcriptome fasta constructed similarly; then it's easier to count ambiguous reads with salmon after alignment. I don't know. But it would still be important to align to the genome at some point for QC.

A reminder that my "best I could come up with" version of the yeast transcriptome is at yeastutrgff.

2. Process reads as much as possible in a single workflow script, through bam files to read counts and maybe even 3' ends?

Life will be so much easier if we can put this in a workflow script early. I'm a fan of nextflow. My recent "best attempt" for single-end quant-seq fwd is here: CryptoGat201RNASeq_draft/src/quantseqfwd.nf. (That means that it should run on the read 2-only for the QSrev files.) This does a number of nice things including automatically generating a multiQC report for fastqc reads and (hisat2) alignment statistics.

Note: the parameters params.featuretype, params.featurename may need to be adjusted to refer to the feature names and types actually used in the gff file.

The nextflow nf-core rnaseq pipeline covers paired-end reads, but is written in nextflow DSL 2.0 which last time I looked was badly documented or maybe I just didn't understand it. Version 1.4.2 of the nf-core rnaseq pipeline is in the previous version of nextflow.

In short, it would be extremely useful to have a pipeline script, ideally in nextflow, that:

FastQCs
trims adaptors
aligns
counts reads to features. (featureCounts OR salmon OR ??)

3. QC of gross features: assess total reads, aligned reads, adapter and overrepresented reads per sample.

If we include multiQC in the pipeline then it's only a short step to having a supplementary table of read statistics per sample. And the multiQC report to check overrep seqs etc. Let's try that.

4. QC of read counts: are RNA-seq reads for genomic reads reproducible per sample?

A heatmap of correlations between all the samples would be great. We could also try checking in more detail with GGally::ggpairs(). At least a few log-log scatter plots would be great.

5. Plot normalized read counts on reporter constructs across samples. Are we detecting the right constructs in (only) the right samples?

Ideally our alignment strategy and pipeline produces a matrix with the reads on every native yeast gene, and URA3, and every construct. And the scatter plot/ heatmap shows that overall gene-by-gene counts are conserved. Then we should show the normalized counts for each reporter construct, hoping that we don't detect any of them in the negative control (POT1-ccdb) and we detect the variants only in the sample they come from.

For normalizing the reads, I would try in this order:

counts per million aligned reads
ratio of counts to a basket of reference RNAs similar to the qPCR (I think RPS3, PGK1, URA3)?
ratio of counts to the median mRNA
a normalization suite, e.g. DEseq2 or salmon.

Either DESeq2 or salmon could also give us a significance value for fold-changes on the constructs, but that might require some re-naming of the construct column as for DE purposes we want to treat them as the same, but for alignment and QC purposes we want to treat (some of) them as different.

Output should be a plot with x-axis sample ID and y-axis the normalized counts. Probably different shapes for the bioreps? And we should make a couple of comparison plots on control genes to check their stability with the same metric.

We need to take extra care to find out if reads on native RPS3 (in POT1-ccdb negative control sample) align to reporter constructs, in addition to looking for cross-alignments.

At some point here or earlier in the QC, we are likely to find out that the alignment parameters need to be changed for one reason or another. We might even have to change the alignment strategy. So we will be extra-grateful that we can rerun the whole pipeline in a single script.

6. Plot polyadenylation sites on reporter constructs and some control genes.

First check this visually by genome browser plot, probably also with the .bam file reads, to see in all its gory glory.

For the paper, I love the cumulative distribution plots that @daneckaw made for the 5Pseq with x-axis the position of the polyadenylation site near the stop codon, and y-axis the cumulative proportion of 3' ends of reads. Was that made from the bamfile in R or did you do something like a bedtools genomecov -3 first? See genomecov documentation.

Ideally, code to produce the data frame for the cumulative distribution plot would be wrapped in some nice reusable function.

So we would like those plots for:

constructs. Separately I think for the "native RPS3" ones and the "mod" constructs - again, @daneckaw, can you remind us what you did with the 5PSeq.
control genes (again, RPS3, PGK1, URA3 are good examples)

7. Submit to gene expression omnibus

Submit to GEO as soon as possible before we forget anything.

Next steps

Does this make sense?
What needs changing?
Who is doing which part of it?
What timeline can we do it in?

DimmestP / chimera_project_manuscript

Draft plan for QuantSeq analysis #127

Goals:

Tasks

-1. Download data, verify checksums, write protect, backup

0. Make a git repo to contain just the QS analysis

1. Decide on alignment (fasta file + gff file) strategy.

2. Process reads as much as possible in a single workflow script, through bam files to read counts and maybe even 3' ends?

3. QC of gross features: assess total reads, aligned reads, adapter and overrepresented reads per sample.

4. QC of read counts: are RNA-seq reads for genomic reads reproducible per sample?

5. Plot normalized read counts on reporter constructs across samples. Are we detecting the right constructs in (only) the right samples?

6. Plot polyadenylation sites on reporter constructs and some control genes.

7. Submit to gene expression omnibus

Next steps