Running slingshot on low dimensional data

naddsch commented 5 years ago

Hi there, I am just trying to use your package on flow cytometry data. I read in the User Guide that "Currently, alternative input data such as ATAC-Seq or cytometry data are not yet supported, although it is possible to simply include this data as expression and counts."

I am trying an examplary dataset of ~30.000 cells and 5 dimensions with a pre-set start_id, because I know how the cells develop concerning these 5 dimensions. I wrapped this data into counts and expression of a dataset.

Unfortunately when running

model_slingshot <- infer_trajectory(dataset, ti_slingshot(), verbose = TRUE)

I get the following error

Executing 'slingshot' on '20190626_154646__data_wrapper__e6QaaLwGP3'
With parameters: list(shrink = 1L, reweight = TRUE, reassign = TRUE, thresh = 0.001,     maxit = 10L, stretch = 2L, smoother = "smooth.spline", shrink.method = "cosine"),
inputs: expression, and
priors : 
Input saved to C:\Users\...\AppData\Local\Temp\RtmpKGtDxe\file29f863416e7c/ti
Running method using babelwhale
Running "C:\PROGRA~1\DOCKER~1\docker.exe" run -e "TMPDIR=/tmp2" --workdir /ti/workspace -v "/c/Users/.../AppData/Local/Temp/RtmpKGtDxe/file29f863416e7c/ti:/ti" -v \
  "/c/Users/.../AppData/Local/Temp/RtmpKGtDxe/file29f843c423f4/tmp:/tmp2" "dynverse/ti_slingshot:v0.9.9.01" --dataset /ti/input.h5 --output /ti/output.h5
Loading required package: princurve
Error in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth = TRUE,  : 
  max(nu, nv) must be strictly less than min(nrow(A), ncol(A))
Calls: <Anonymous> -> do.call -> <Anonymous>
Execution halted
Error: Error during trajectory inference 
Loading required package: princurve
Error in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth = TRUE,  : 
  max(nu, nv) must be strictly less than min(nrow(A), ncol(A))
Calls: <Anonymous> -> do.call -> <Anonymous>
Execution halted

I think that the following line in ti_slingshot causes the problem since n=20 is fixed here and my dataset does have less than 20 dimensions :

pca <- irlba::prcomp_irlba(expression, n = 20)

Or am I missing something?

Thanks in advance, Laura

zouter commented 5 years ago

Hi Laura!

Nice investigative work, that line was indeed the problem. I now changed this inside the wrapper, and added an ndim parameter if you would want to change the number of components.

This is now building on travis, I'll post an update once that's finished.

Slingshot will probably take awhile to run on this dataset (estimated 2 hours). Have you also considered other methods, such as PAGA?

Best Wouter

rcannood commented 5 years ago

(slingshot has recently made some speed improvements, see kstreet13/slingshot#31, it might be that the new build is a lot more scalable)

naddsch commented 5 years ago

Hi Wouter, thanks for the quick help and reply! I think in your commit "add ndim parameter, fix for dynverse/dyno#55" you might have forgotton to replace n=20 by n=ndim in the call to

pca <- irlba::prcomp_irlba(expression, n = 20)

Actually I also tried to use PAGA and PAGA tree, but here I get another error. Still trying to figure it out. I'll open another issue for that method :)

Best, Laura

zouter commented 5 years ago

Hi Laura

Thanks :blush:

For PAGA, you're error is probably related within the feature filtering that is done in the beginning. It's still an enigma to me why it errors internally. Feel free to make a new issue!

zouter commented 5 years ago

This should be fixed with ti_slingshot 1.0.2 and onwards. You can run this now using :

infer_trajectory(dataset, "dynverse/ti_slingshot:v1.0.2")

Will be included in dynmethods soon-ish once travis stops complaining :crossed_fingers:

naddsch commented 5 years ago

Thank you so much! He's skipping dimensionality reduction now. :blush:

Unfortunately I run into another error afterwards:

Error: cannot allocate vector of size 3.1 Gb
Execution halted

I already increased memory.limit() to 16GB as this is what's installed on my machine, but the error keeps popping up. I'm running 64bit R on 64 bit Windows 7. Do you know how to eliminate this behaviour?

And there is not much memory used when running infer_trajectory:

> memory.size()
[1] 687.63

Best, Laura

zouter commented 5 years ago

I did a quick investigation and I think this is caused by the pam clustering, which doesn't really scale well with increasing number of samples. I added a parameter cluster_method where you can change the clustering method to clara, which is closely related to pam but much more scalable.

You might have to lower the ndim argument as well, because otherwise you might get some convergence errors (the principal curves algorithm seems to be very sensitive to this).

travis is building the container at the moment (https://travis-ci.org/dynverse/ti_slingshot/builds/551362660), if all goes well you should be able to do

infer_trajectory(dataset, "dynverse/ti_slingshot:v1.0.3", ndim = 3, cluster_method = "clara", verbose = TRUE)

in a couple of minutes.

Along with the improvements that @rcannood made, this should make slingshot more scalable, although it still takes some minutes to run on my 30k cells and 10 features examples.

naddsch commented 5 years ago

Perfect, this is working like a charm! Thanks a lot, I will now try to figure out how to proceed in the analysis pipeline :smiley:

Have a nice weekend!

dynverse / dyno

Running slingshot on low dimensional data #55