The observed CRISPR/Cas9 twist knock-out phenotype (Bubble, failed tentacle formation) is novel in Nematostella vectensis.
Two general hypotheses:
NV-twist is involved in tentacle formation (broader context; muscle cell differentiation)
NV-twist as a master TF regulator is involved in initiating a cancer-like phenotype (first described in Nematostella vectensis)
Bubble (twist mutant with phenotype)
TwiHead (twist mutant without phenotype)
WTHead (as the name says)
Twi4d (twist mutant 4dpf)
WT4d (WT 4dpf)
All were done as triplicates and have the suffix 1,2,3 added to their name depending on the replicate
For the resequenced libraries the have the RS suffix added to them.
Bubble 1,2 & 3
Twihead 2
WTHead 1 & 2
Compared were:
Bubble vs TwiHead
Bubble vs WT
Twihead vs WT
Twi4d vs WT
From the bulk-RNA Seq we observed 4 variants which are mutant-specific. We exclusively concentrated on INDELs (because they are most likely to produce truncating proteins). We run the pipeline chromosome-wise, meaning that we get population estimates from the input samples (sum 15). After annotation and filtering, we have a list of 4 INDEL variants which are of particular interest (present in all 9 mutant bulk-RNA samples, and missing in the remaining 6 wildtype samples).
Moreover, I discovered 5 variants which are unique to the wild-type samples (e.g. TwiHead, WT4d, etc.) and are missing in the CRISPR-mutants. Meaning that those variants acquired the mutation 'naturally' (de novo), and CRISPR doesn't effect those genes in the mutants. As we are not particularly interested in those, we are not concentrating on this oberservation for this project.
From this point of view, we will consider NV2.10979 and obviously twist in the scRNA data. The questions we want to address are:
Before, I move on to scRNA libraries (data), I will define a GOI - differentially expressed in the mutants and wild-types. This set of genes should serve as a sanity check, whether we see the same pattern in scRNA data.
We have used cellranger to map the RNA reads to the reference genome and obtained a count-matrix per cell. Individual cells are captures in the cellranger output, through unique Barcode sequences (UMI's). This means, Seurat uses a count-matrix, where columns represent unique DNA-Sequences (UMI's = cells) and rows are features (in the provided annotation table we have roughly 20,000 features for Nematostella). For every cell and feature we have a count (produced by cellranger; and will be loaded in Seurat.
First we look into features (per cell), RNA counts and percentage MT. These metrics are used as a quality measure. Cells where the the feature count is below 250 are discarded (not informative) and cells with RNA-counts >20,000 are equally discarded - as this may be indicative of a duplicate cell sequenced (two cells in 10X rather than a single Transcriptome). We filter out those cells, apply a normalization and scaling approach scale_factor = 5,000
[Alison's suggestion].
We then merge the two libraries (Mutant and Control) and define a list of highly variable genes (by default, Seurat uses vst
AND nfeatures = 2000
).
We then run a PCA runPCA
(first dimision reduction). Every dot basically represents a cell with it's transcriptome. This means, with runPCA
we take the 2,000 variable features (among cells) and cluster cells (dimension reduced) with similar transcriptomic profiles runPCA example.
What we can clearly see from the plot:
quantitative (normalized) NvTwist expression among cell clusters. You can see that NvTwist is expressed in both libraries (at a varying degree). However, consider the different density. It seems like that NvTwist is more expressed in controls and in mutant libraries (especially, the left corner).
Constructs a Shared Nearest Neighbor (SNN) Graph for a given dataset
.
How to select the resolution. You see we start off with 14 clusters; Looking into the top-row, we see that at low resolution we have direct trajectories. However, there is one cell-identity lineage which does not change at any resolution, meaning that 9 stays very much the same and show distinct features. Whereas cluster 0 gets a little bit blurry and splits in many smaller clusters (at higher resolution). Looking at low-level plots (getting sense of data); a low resolution is good to go. However, we are interested in getting the most discinct clusters - just looking @this plot, we are choosing resolution = 0.8
, because there we are keeping the most sensitiviy and structure.
When we look into the reduced and clustered ID's Figure we can clearly see that there are two features which distiguish the mutants from the controls. Firstly, we have a cell cluster 20(turquois) which is apparent in the mutants and missing in the the controls.
Secondly, we can also see that cell cluster 13 may be of interest, as this cluster is enriched for NvTwist in the control and largely missing in the mutants. At least, NvTwist is lower expressed in the mutant library at this cluster. We can also see that the overall clustering pattern looks 'clean' in the sense, that there is little dispersion of different cell-identities with other clusters. Apart from 23 AND 24. Look also at this Figure. There you can clearly see that cluster 20 seems to exclusively be just available at the mutant library where it is missing in the control.
RNA-Seq (bulk) were qc'ed, trimmed and aligned to the nemVec2 genome using STAR. STAR is currently the best INTRON-aware aligner;
We are specifically interested in variants in the bulk-RNA data, as single-cell data can't be used to find off-targets. We are using the RNA-bulk data as a sanity check for the single-cell data. Biological material for both sequencing strategies (bulk, single-cell) comes from the same animals!
We are running bcftools mpileup population-wise (meaning all 15 samples together) and loop through chromosomes. Because we are running bcftools population-wise we can calculate proper AF values. The amount of alternative alleles (within 15 samples) divided by reference (amount). AF < 0.2 are discared.
We then move on to find intersects (bedtools); we concentrate just on tcs_exon_gtf - as we only want to see variants on exons (likely functional effect); biological effects of intronic variants are hard to predict!
twist; Nv2.10864; chr2: 2,358,626 - 2,359,015 (gene coordinates and location: 389 bp long); mRNA
bcftools mpileup -f reference.fa alignments.bam | bcftools call -mv -Ob -o calls.bcf
[/scratch/kreitzer]: bcftools mpileup -f /scratch/jmontenegro/nvectensis/data/refs/nv_dovetail_4_gapped_chroms.final.fasta /proj/ferrer/rna_seq_twist/X204SC20120808-Z01-F001/raw_data/results/map/[...Aligned.bam] | bcftools call -mv -Ob -o calls.test.bcf
one working example on /scratch/
https://wiki.csb.univie.ac.at/doku.php?id=content:working_environment:services:slurm_workload_manager
Working is intented to take place on /scratch
Processing and analysis should be conducted there. Once pipeline is through (/scratch/) ~ move data to /proj/