phateR for two more datasets

MichaelPeibo commented 5 years ago

Hi, phateR team

I have been trying phateR in with our datasets.

For two more datasets, after phate() function, I can always see relatively separated between two data sets when plotting.

One possible reason is my two data sets are intrinsically separated in some way. I wonder if you have ever encounter a problem such as this? E.g., if you use two datasets with large time-interval(day0 and day18 for EB data)

Thanks!

dburkhardt commented 5 years ago

Hello Michael - this problem is often called batch effect (for lack of a better phrase) and means that there is some large gap separating the pairwise distances between samples. If you visualize the diffusion operator as a heatmap, you should see much higher affinities within batch than between batch.

The solution here is to apply some method of batch correction prior to running PHATE. If you'd like more detailed help, please feel free to post some images on our Slack help group (krishnaswamylab.org/get-help).

MichaelPeibo commented 5 years ago

Hi, @dburkhardt

Thanks for help, I know some packages solving batch effects problem, such as ScaleData() in Seurat or multiBatchNorm() in Scater. I will try to put data matrix into phate() with data matrix after scaling.

Just wonder what is your best experience(like which kind of method) for fitting with phate()?

dburkhardt commented 5 years ago

First, a caveat: Batch correction forces connections between cells that don't exist in Euclidean space in a dataset. Each algorithm makes certain assumptions about the nature of the batch effect. If those assumptions are broken, then the algorithm will happily merge the datasets in a non-valid way without telling you. This is beyond the scope of PHATE, and your mileage may vary.

This all being said, we find that MNN works well for most cases. The paper describing MNN is here: https://www.ncbi.nlm.nih.gov/pubmed/29608177. I would try to first batch normalize using MNN, then visualizing the corrected data using PHATE.

Please let me know if this was helpful, and share how this works!

MichaelPeibo commented 5 years ago

Hi @dburkhardt Thanks for introducing MNN methods, I tried it with fastMNN function, I am still confused about how to use the output of MNN-corrected values, since the MNN-corrected values it is similar like low-dimention embeddings.

So, what do you mean by

visualizing the corrected data using PHATE?

What I should use as input for phate function to get phate embeddings?

Thanks! Happy new year!

MichaelPeibo commented 5 years ago

Hi @dburkhardt Any suggestion on latest question?

Many Thanks!

dburkhardt commented 5 years ago

Hi Michael,

Sorry for the slow response. I don't actually have experience using fastMNN. Looking at this page:

https://rdrr.io/github/MarioniLab/scran/man/fastMNN.html

B1 <- matrix(rnorm(10000), ncol=50) # Batch 1
B2 <- matrix(rnorm(10000), ncol=50) # Batch 2
out <- fastMNN(B1, B2) # corrected values
names(out)

# An equivalent approach with PC input.
cB1 <- cosineNorm(B1)
cB2 <- cosineNorm(B2)
pcs <- multiBatchPCA(cB1, cB2)
out.2 <- fastMNN(pcs[[1]], pcs[[2]], pc.input=TRUE)
all.equal(head(out,-1), out.2) # should be TRUE (no rotation)

# Obtaining corrected expression values for genes 1 and 10.
cor.exp <- tcrossprod(out$rotation[c(1,10),], out$corrected)
dim(cor.exp)

It seems to me like cor.exp is the corrected values for all genes.

We've also implemented an experimental MNN feature in PHATE on the dev branch.

You can install the dev version using devtools

devtools::install_github("KrishnaswamyLab/phateR", ref="dev")

And use the method with the following syntax

phateR::phate(df, sample_idx=as.integer(c(rep(1, nrow(df)-50), rep(2, 50))), kernel_symm='theta', theta=as.double(1), beta=as.double(1))

Here, sample_idx is an array with one value associated with each batch. The kernel_symm='theta' is where MNN is specified. The beta parameter scales the degree of each point within vs between batches. beta=1 means that the degree of a point to 'other' batches should be no more than 1 times the degree within batches. Large values do more batch correction, small values do less.

This code is still experimental, so YMMV. We'd love to hear if you try this and how it works with your data.

MichaelPeibo commented 5 years ago

Hi @dburkhardt Nice demonstration! Actually I am confused about different batches/time points data, I also mentioned this here.

In brief, a more specific case is, if I have time-course data1 which has not geneX, however, I time-course data2 will have geneX till days later. In mnn correction, a prerequisite is same genes, will it filter out some genes meaningful?

And I see this tutorial which seemed to not do batch correction, only batches filtering and merging.

KrishnaswamyLab / phateR

phateR for two more datasets #26