Slingshot - Githubissues

Hello @kstreet13 and @sandrinedudoit

This issue is for discussing the wrapper for your trajectory inference method, slingshot, which we wrapped for our benchmarking study (10.1101/276907). In our dynmethods framework, we collected some meta information about your method, and created a docker wrapper so that all methods can be easily run and compared. The code for this wrapper is located in a docker container. The way this container is structured is described in this vignette.

We are creating this issue to ensure your method is being evaluated in the way it was designed for. The checklist below contains some important questions for you to have a look at.

[x] Parameters, defined in definition.yml (more info)
- Are all important parameters described in this file?
- For each parameter, is the proposed default value reasonable?
- For each parameter, is the proposed parameter space reasonable (e.g. lower and upper boundaries)?
- Is the description of the parameters correct and up-to-date?
[x] Input, defined in definition.yml and loaded in run.R (more info)
- Is the correct type of expression requested (raw counts or normalised expression)?
- Is all prior information (required or optional) requested?
- Would some other type of prior information help the method?
[x] Output, defined in definition.yml and saved in run.R (more info)
- Is the output correctly processed towards the common trajectory model? Would some other postprocessing method make more sense?
- Is all relevant output saved (dimensionality reduction, clustering/grouping, pseudotime, ...)
[x] Wrapper script, see run.R (more info)
- This is a script that is executed upon starting the docker container. It will receive several input files as defined by definition.yml, and is expected to produce certain output files, also as defined by definition.yml.
- Is the script a correct representation of the general workflow a user is expected to follow when they want to apply your method to their data?
[x] Quality control, see the qc worksheet
- We also evaluated the implementation of a method based on a large check list of good software software development practices.
- Are the answers we wrote down for your method correct and up to date? Do you disagree with certain answers? (Feel free to leave a comment in the worksheet)
- You can improve the QC score of your method by implementing the required changes and letting us know. Do not gloss over this, as it is the easiest way to improve the overall ranking of your TI method in our study!

The most convenient way for you to test and adapt the wrapper is to install dyno, download and modify these files, and run your method on a dataset of interest or one of our synthetic toy datasets. This is further described in this vignette. Once finished, we prefer that you fork the dynmethods repository, make the necessary changes, and send us a pull request. Alternatively, you can also send us the files and we will make the necessary changes.

If you have any further questions or remarks, feel free to reply to this issue.

Kind regards, @zouter and @rcannood

Hey @zouter and @rcannood,

Thanks for putting this all together! Overall, the Slingshot wrapper looks good, here are my responses to your questions:

Parameters

We recently added the reassign parameter, which replaces drop.multi (default = TRUE).
It looks like the default clustering method is k-means with 5 clusters, but we think this may not be enough clusters to identify more complicated lineage structures. We generally select a clustering method based on the type of data being analyzed and expected number of clusters/lineages, but for the purposes of your simulation, would it be possible to choose the number of clusters in a more data-adaptive way (ie. silhouette width, AIC/BIC, gap statistic, etc)? We also generally prefer other clustering methods, such as mclust or RSEC.
Similarly, 3-dimensional PCA is not appropriate for all datasets/types. In practice, we would choose the method we find most appropriate, but any way of making the choice of dimensionality reduction more data-dependent would be appreciated (ie. TSCAN's regression-based method for choosing the number of PCs to retain).
The values for thresh look a little strange. Our default is 0.001, but you have it listed as -3.0. Should this be 10^-3.0?
Our default value for maxit is 15, not 10.

Input

I think this is correct, but just to be sure: the input data should be normalized expression levels (our default was quantile normalization, as in the FQnorm function).

Output

Looking at the run.R script, it looks like you're getting the pseudotimes from the curves, though we recommend using the slingPseudotime function. Similarly, weights of assignment for cells to lineages can be extracted with the slingCurveWeights function. This shouldn't change the actual values, however.

Wrapper script

Again, we generally hope that users will try multiple dimensionality reduction and clustering methods before settling on one that works well for their data. But otherwise, this is a good representation of the intended workflow.

Quality control

Thank you for the helpful suggestions! We have made several improvements (see comments on Google Sheet).
Regarding the postprocessing (line 43), there should be no need of an additional projection step after running Slingshot.

Best, @kstreet13

Thanks for getting back to us :) Could you have a look at sections Clustering and dimensionality reduction and Output ?

Parameters

I have made several changes to the parameters in response to your comments (616ef9b5e899ebefd383187a4c156c3b96de5bae).

We recently added the reassign parameter, which replaces drop.multi (default = TRUE).

reassign has been added to the parameter list.

The values for thresh look a little strange. Our default is 0.001, but you have it listed as -3.0. Should this be 10^-3.0?

Sorry, this was a mistake on our part, we did some recent refactoring of dynmethods during the implementation of docker support :)

Our default value for maxit is 15, not 10.

Fixed!

Clustering and dimensionality reduction

It looks like the default clustering method is k-means with 5 clusters, but we think this may not be enough clusters to identify more complicated lineage structures.

We agree :)

We generally select a clustering method based on the type of data being analyzed and expected number of clusters/lineages, but for the purposes of your simulation, would it be possible to choose the number of clusters in a more data-adaptive way (ie. silhouette width, AIC/BIC, gap statistic, etc)? We also generally prefer other clustering methods, such as mclust or RSEC.

Similarly, 3-dimensional PCA is not appropriate for all datasets/types. In practice, we would choose the method we find most appropriate, but any way of making the choice of dimensionality reduction more data-dependent would be appreciated (ie. TSCAN's regression-based method for choosing the number of PCs to retain).

Again, we generally hope that users will try multiple dimensionality reduction and clustering methods before settling on one that works well for their data.

I understand that performing 3 dimensional PCA and kmeans clustering might now always provide the best results. However, I don't think you should expect an end-user to have to choose which dimred or clustering method "looks the best" (though you can certainly present him with some parameters if he wishes to tune the results). Our benchmarks also don't support allowing a human being to decide which dimensionality reduction and clustering method is fit for a given dataset.

If you would like the dimensionality reduction and the clustering to depend on the given data, I would like for you to propose some code on how to do this. Ideally, this code should be integrated into the Slingshot package, as otherwise there will be a difference in behaviour between the bioconductor/slingshot and the dynverse/slingshot.

Input

I think this is correct, but just to be sure: the input data should be normalized expression levels (our default was quantile normalization, as in the FQnorm function).

Yes, we used the normalisation as shown in the bioconductor vignette, namely:

norm <- FQnorm(counts)
expr <- log1p(norm)
pca <- prcomp(expr)

Output

Looking at the run.R script, it looks like you're getting the pseudotimes from the curves, though we recommend using the slingPseudotime function. Similarly, weights of assignment for cells to lineages can be extracted with the slingCurveWeights function. This shouldn't change the actual values, however.

I've attached an explanation of how we convert the output provided by Slingshot to the milestone_network and progressions objects that dynwrap requires.

Back when we originally implemented the Slingshot wrapper, we couldn't find a way to combine the pseudotime information provided by slingshot with the structure of the trajectory (e.g. A→B, B→C, B→D). We opted to use the trajectory provided by Slingshot, and to project the cells to the nearest point on the segment (as explained in section "Progressions with projection" in the attached html file).

However, the output produced by slingPseudotime and slingCurveWeights reminds me of the output produced by FateID, SCOUP, GPfates and STEMNET, where only a pseudotime and end state probabilities are given. Check out section "Progressions with fate assignment" of the attached html file to see how this wrapper would work. Do you think this methodology is closer to what you had in mind? This method might get a lower topology score, as slingLineages is not used anymore to obtain the milestone_percentages.

If you like, we can call the projection-based approach "Slingshot Projection", and the fate assignment based approach just "Slingshot". We can remove the projection-based wrapper altogether.

Quality control

Thank you for the helpful suggestions! We have made several improvements (see comments on Google Sheet).

I have made the following changes to your QC scores. This should increase the QC value of slingshot by quite a bit :)

Is code coverage assessed? 0 → 1
What is the percentage of code coverage? 0 → 0.78
Method is published in a peer-review journal? 0 → 1
Paper quantifies the accuracy of the method ....? 0 → 1

Regarding the postprocessing (line 43), there should be no need of an additional projection step after running Slingshot.

I left this step unchanged for now, depending on what your preferences are with respect to the discussion in the previous section.

Sorry for the delay! After giving it some thought, here are my suggestions.

Dimensionality Reduction and Clustering

Given how well Slingshot performed with your original defaults, I didn't want to change things too much, but I think these methods should be pretty widely applicable and a little more data-adaptive.

# ----- dimensionality_reduction
library(princurve)
# this code is adapted from the expermclust() function in TSCAN
# the only difference is in how PCA is performed
# (they specify scale. = TRUE and we leave it as FALSE)
x <- 1:20
optpoint1 <- which.min(sapply(2:10, function(i) {
    x2 <- pmax(0, x - i)
    sum(lm(pca$sdev[1:20] ~ x + x2)$residuals^2 * rep(1:2,each = 10))
}))
# this is a simple method for finding the "elbow" of a curve, from
# https://stackoverflow.com/questions/2018178/finding-the-best-trade-off-point-on-a-curve
x <- cbind(1:20, pca$sdev[1:20])
line <- x[c(1,nrow(x)),]
proj <- project_to_curve(x, line)
optpoint2 <- which.max(proj$dist_ind)-1
# we will take more than 3 PCs only if both methods recommend it
optpoint <- max(c(min(c(optpoint1,optpoint2)), 3))
rd <- pca$x[, 1:(optpoint)]
# -----

# ----- clustering
library(cluster)
clusterings <- lapply(3:10, function(K){
    pam(rd, K) # we generally prefer PAM as a more robust alternative to k-means
})
# take one more than the optimal number of clusters based on average silhouette width
# (max of 10; the extra cluster improves flexibility when learning the topology, 
# silhouette width tends to pick too few clusters, otherwise)
wh.cl <- which.max(sapply(clusterings, function(x){ x$silinfo$avg.width })) + 1
cl <- clusterings[[min(c(wh.cl, 8))]]$clustering
# -----

Output

I think that both of the methods you describe for handling the output make sense and either one would be a fair interpretation. Ideally, it would be good to have a conversion strategy that uses both the inferred topology and the simultaneous principal curves, but this seems to be a tough fit with the common trajectory model.

My best idea is to define milestones at the endpoints and branching points of all lineages. Each cell can be assigned to a lineage based on its maximum curve weight and then to a particular branch based on where its pseudotime lies relative to the means of each of these milestone clusters. I'm attaching an R script with a function I've written to perform this conversion (from a SlingshotDataSet), but it's entirely possible that I've misunderstood something about the common trajectory model, so just let me know if this doesn't work for any reason.

Hello Kelly,

I've added in the dimensionality reduction, clustering and calculation of progressions suggested by you: run.R (currently still in the devel branch), and it seems to be working just fine!

@zouter and I are planning to do another iteration of the benchmarks (more methods, more datasets) soon. We'll keep you posted :)

Robrecht

Oh, and I changed QC score at row 44 from '0.5 Moderate' to 1

Great, thank you very much!

I'm closing this issue. If ever you have any further remarks, please feel free to make a new issue :)

dynverse / dynmethods

Slingshot #53

Parameters

Clustering and dimensionality reduction

Input

Output

Quality control

Dimensionality Reduction and Clustering

Output