Progress Report of Bloodies

fangwuwang commented 7 years ago

Progress Report of Bloodies

What has changed based on the final proposal

We are using the same datasets as proposed but has eliminated a few samples from the analysis. For example, we removed one outlier sample from the normal cell RNA-seq data and randomly chose 7 out of 11 samples from the AML dataset to keep the number of two groups (AML and CLL) equal to each other. Also, we noticed that the MEP population in the RNA-seq data we planed to include in our analysis is missing. The major conceptual change of this project is that we will conduct the transcription factor binding analysis based more on the differential DNA methylation results instead of the RNA-seq data, since there is no replicate in the RNA-seq data as mentioned in our proposal and we suspect there is high variability in this data that we cannot effectively measure. The task assignment of group members remains the same except that Fangwu has taken the data preparation part of the DNA methylation and RNA-seq data.

What is the progress of the analyses

I. DNA methylation analysis of the seven normal progenitor populations

Our DNA methylation (bisulfite-seq) and RNA-seq data were from the published BluePrint Epigenome project [1] which were downloaded from GEO (GSE 87197). The RNA-seq data was deposited in our Data folder and the code for getting the methylation data was provided here using GEOquery. There are three biological replicates for each of the seven populations. The data were post-aligned, unquantified in Bed format and we first performed merging of technical replicates to increase the overall coverage. We used bedtools and R to add up the reads from technical replicates (aliquots from the same donor condition) which resulted in three merged data (three biological replicates) for each cell type (details and codes for merging was shown here). Then we used the RnBeads software to conduct the DNA methylation analysis, including data import, QC, methylation quantification, region annotation and differential methylation analysis [2].

The datasets from the samples, that were combined into bed files based on the cell type were then subjected to the RnBeads package for analyzing the different methylation profiles. The code for the RnBeads analysis can be found here and the annotation file here. The data files cannot be shared, however, due to size restriction on GitHub. Prior to running the analysis, I have set certain memory control arguments in the rnboptions() to utilize less resources on the system.

The bed files were imported for analysis using bismarkCov as a bed style as they were combined using BisMark's coverage file output. The particular style was chosen, as the files contain methylation site information on the chromosomes, including methylation signal intensity, and un-methylation intensity which were used to calculate density plot of beta values. We also disabled greedy cut as it consumes a lot of memory for the analysis and thereby slowing the process down. Furthermore, just prior to running the analysis, we set the columns for differential methylation comparison to the cell type as we wanted to look into methylation profiles across the different types of cells.

II. RNA-seq analysis - normal progenitors

The RNA-seq data contains 62,589 rows (transcripts) and 14 columns (samples). The values in the data represent estimated reads from the Bitseq software. We first inspected the data by visualizing the sample-sample correlation in the heatmap and we saw an outlier sample “RNA_D2_GMP_100” that showed relatively poor correlations with other samples, even with the sample of the same cell type “RNA_D1_GMP_100”. The Rscript for this part is here. Then we took out this outlier sample and also removed several populations that we are not looking at (HSCbm, MLP0, MLP2, MLP3). There is no replicate available for the HSC, CMP, GMP, CLP populations and there are two replicates for MPP and MLP. We filtered out the transcripts with the total reads from all samples of less than 50. We found there was a big differences between the mean of each sample the effective library sizes might be different among samples, so we used DESeq sizefactor function to adjust for the discrepancy of library sizes. After this adjustment, the column means look closer to each other. Then we did a clustering in the heatmap of the 2000 most abundantly expressed transcripts to see if the cell types clustered based on the differentiation hierarchy. Unexpected, the clustering pattern was not fully correlated with the differentiation hierarchy. MPP and HSC were closely grouped together indicating similar stem cell expression profile. The GMP population was clustered with the lymphoid progenitors MLP and CLP, which is consistent with recent findings that these progenitors are not completely separated in their functions and there might be “cross-activities” of their lineage output. Other explanations could be that the noise in the samples is pretty high and cannot be taken into account due to the lack of replicate. Also, the most abundant genes may not necessarily reflect the “molecular signatures” of these samples. Without biological replicates, we only looked at the fold-change of expression values between samples only for the exploratory purpose. The R codes for this part is here.

III. RNA-seq analysis - leukemia samples

We obtained two types of leukemia samples (AML and CLL) from the BLUEPRINT consortium as a compensation analysis and validation data for the normal RNA-seq data. There are seven patient samples in each group. A list of differentially expressed genes in the two groups were generated using limma and edgeR Bioconductor packages. We first normalized the data and fitted into a linear model. We created a mean difference plot displaying the log-fold-changes and average log expression values for each gene. We also inspected the P-value distribution. With this linear model, the differentially expressed genes were detected.

Results

I. DNA methylation analysis of the seven normal progenitor populations

The pre-processing of the samples produced a beta (β) value density plot for identifying the criteria, for removing samples from analysis which do not have a good enough coverage. The plot can be found here. We also performed Quality Control on the data for a summary on the coverage in each sample cell type, and a resulting spreadsheet file can be found at this link. The coverage distribution in each of the cell types can also be visualized using the the following violin density plots - A and B. The analysis also produced differential methylation results, and PCA clustering but we are yet to analyze the outputs.

II. RNA-seq analysis - normal progenitors

We calculated the mean value of the two replicates for the two populations (MPP, MLP) and transformed the data to log2 scale and calculated the “log2FC” for each pair of comparison. The lists of genes with over 2-fold changes were generated for the pairwise comparisons of HSC vs. MPP, MPP vs. CMP, MPP vs. CLP, CLP vs. MLP, CMP vs. GMP. These pairs were chosen because they were directly related in the differentiation hierarchy.

III. RNA-seq analysis - leukemia samples

Using edgeR and limma, a mean difference plot was created displaying the log-fold-changes and average log expression values for each gene. From this plot, we found that the genes with high fold-changes between the two groups tend to be moderately or lowly expressed. We generated a distribution histogram for the adjusted-p value (shown as P. Value in the pot). The values are not uniformly distributed which indicates the null hypothesis is rejected or there are significant differences between the two leukemia groups(AML and CLL). With a cutoff of adjusted p-value of 0.05 and logFC of 1, around 9000 genes were evaluated as differentially expressed.

Challenges

A major problem with the RNA-seq analysis is that there is no replicate for many samples so any variation-based statistical analyses are not applicable. So we decided to look at the fold change of the expression value as an exploratory process without drawing any definitive conclusion based on this data. Another challenge with the DNA methylation data is that there are many aliquots consisting of small number of cells (1, 10, 50, 1000 cells) for each biological sample, however the coverage is very low (around 1 per CpG site) due to the low cell input. So we merged the data from samples with 50 and 1000 cells to increase the overall coverage. However, a drawback of this manipulation is that there might be technical variation within the replicates, although they are from the same batch and the same flowcell. And in some cases the intersection between replicates is only a small proportion of the total reads and will not substantially increase the coverage.

References:

Farlik M, Halbritter F, Müller F, et al. DNA Methylation Dynamics of Human Hematopoietic Stem Cell Differentiation. Cell Stem Cell. 2016 Dec 1;19(6):808-822. PMID: 27867036
Assenov Y, Müller F, Lutsik P, et al. Comprehensive analysis of DNA methylation data with RnBeads. Nat Methods. 2014 Nov;11(11):1138-40. PMID: 25262207

psomdeb25 commented 7 years ago

@santina @singha53 Following is our progress report for the assignment, compiled based on each of our contribution.

@fangwuwang @acavalla @rawnakhoque

santina commented 7 years ago

Hi @STAT540-UBC/team-bloodies

Below are my comments. Note that I'm not an expert in this subject. @singha53 feel free to jump in if there's anything to add or if I'm misunderstanding anything.

About your progress report

we removed one outlier sample from the normal cell RNA-seq data and randomly chose 7 out of 11 samples from the AML dataset to keep the number of two groups (AML and CLL) equal to each other.

Why do you need to make sure that sample sizes of two different groups are equal? I don't feel that it's necessary, unless there's a specific reason?

We also disabled greedy cut as it consumes a lot of memory for the analysis and thereby slowing the process down.

I don't know what greedy cut is but usually slow running time is not a justification for cutting short of an analysis, unless we are talking about significant decrease of running time (say 1 week down to a few hours) or that you simply don't have access to enough RAM to run it. If greedy cut is an important configuration that will affect the quality of your analysis, then it's better to run it than to save a few hours.

We first inspected the data by visualizing the sample-sample correlation in the heatmap and we saw an outlier sample “RNA_D2_GMP_100” that showed relatively poor correlations ...

I don't see outlier “RNA_D2_GMP_100” on your heatmap so make sure to fix it in your analysis report :)

We filtered out the transcripts with the total reads from all samples of less than 50.

Are you looking at the gene levels? Why looking at transcripts reads instead of genes reads? Why 50?

Then we did a clustering in the heatmap of the 2000 most abundantly expressed transcripts to see if the cell types clustered based on the differentiation hierarchy.

Why "most abundantly expressed" transcripts? Why not the most variable transcripts or genes? In terms of biology, which ones do you think would be more imformative? (Also are we talking about genes or transcripts?) You could also consider using PCA instead cluster samples based on expressions of all genes.

A list of differentially expressed genes in the two groups were generated using limma and edgeR Bioconductor packages.

edgeR and limma are two separate packages. Both can be used to do RNA-seq analysis but not together. Are you doing two separate RNA-seq analyses and comparing the results? I recently made this RNASeq practice that shows how to use EdgeR, DESeq2, and limma, in addition to seminar 7.

From this plot, we found that the genes with high fold-changes between the two groups tend to be moderately or lowly expressed.

Your plot on log-fold-changes does show that, but it's kind of given since lowly expressed genes naturally can easily have higher log fold changes. For example, going from gene expression of 1 to 3 is easily a 3 fold (not log), versus 100 to 150.

We generated a distribution histogram for the adjusted-p value ...

Usually you'd expect the distribution of unadjusted p value rather than adjusted p values. Read this blog, especially the second paragraph. Also make sure to state your adjustment methods (BH?)

there are many aliquots consisting of small number of cells (1, 10, 50, 1000 cells) ... So we merged the data from samples with 50 and 1000 cells to increase the overall coverage.

Did you do some quality check, like correlation, to make sure that those datasets look similar before you merge them together?

Other comment:

It was hard to figure out which ones you don't have replicates and which ones you do. Making a table somewhere in your Repo and linking it there would help us understand it better when we grade your repo
Maybe I'm just confused, but sometimes I feel like the word "transcript" and "gene" are used interchangeably, but they're not the same things.
It'd help to include a flowchart in your repo and poster as you walk through the steps of your analysis.
Think of each analysis step as making a rmarkdown that's knit into a markdown, where explanation, code, and charts are all in the same place, making it easier for others and your future selves to follow the logic of your study.

Overall, I think you put a lot of thoughts into this study and your project is quite unique. I learned a few things while reading it. Looking forward to see and hear more about your project!

fangwuwang commented 7 years ago

Thank you for the comments and suggestions Santina! These are very helpful. We are going to re-run the RNA-seq analysis on leukaemia samples with all of the samples (11 vs 7) which might increase the power. Sorry about the confusion about the normal RNA-seq data, we have the following 6 populations with replicates shown in the brackets: HSC (1), MPP (2), CMP (1), GMP (1), CLP (2), MLP (2). The data we got from the public source is annotated in transcript (ENST ID) and we planned to first get the list of interesting hits and then convert transcripts into genes. We are currently looking into the ensembldb package, please let us know if you have any experience/advice with this kind of conversion. For the DNA methylation analysis, we opted out greedycut filtering since it is mainly for the array data and is not recommended for bi-seq from the instruction, probably because it is a memory-consuming process ( the running time for the analysis we did last time was 18 hours). @santina @psomdeb25 @rawnakhoque @acavalla

fangwuwang / team_Bloodies