Tutorial for high dimensional cytometry data

luglilab commented 2 years ago

Hi,

I'd like to use pyVIA with flow cytometry data, could you add a tutorial for this kind of data?

Do you have some suggestion about parameters to set?

Thanks you in advance.

Simone

ShobiStassen commented 2 years ago

hi Simone,

Thanks for your message. Have you seen this tutorial for a mass cytometry dataset, this would help you I think: https://pyvia.readthedocs.io/en/latest/mESC_timeseries.html. This tutorial uses the time-series information in addition to the surface marker expression, but you can just ignore the time-series input labels.

Can I ask what the dimensions of your data are (before PCA etc), (n_cells x n_markers)? Depending on the dimensionality you may or may not opt for PCA (of e.g. top 30 pcs) before running Via. If you have a fairly concise set of meaningful proteins/surface markers then you might be better of avoiding PCA. Typically knn of around 20-30 is good for most datasets unless you have very low cell count. If you have a look at the tutorials for other types of data, you can probably use them as a starting point for parameters and then tune depending on the outcome. I have been meaning to make a Parameter Tuning Tutorial too, it's on my ToDo :) The parameters which have most impact are

Number of K Nearest Neighbors, number of PCs (if you do PCA)
jac_std_global (somewhere between 0.15 and 2, with lower meaning more smaller clusters
cluster_graph_pruning_std (also between 0.15 and 2, with smaller numbers meaning fewer edges retained in the cluster graph)
too_big_factor (between 0.1 and 0.3) where smaller numbers break up large clusters to offer more granularity.
To make Streamplots (no RNA velocity needed for this), https://pyvia.readthedocs.io/en/latest/mESC_timeseries.html https://pyvia.readthedocs.io/en/latest/ViaJupyter_Pancreas_RNAvelocity.html

ShobiStassen commented 2 years ago

@luglilab Just wanted to ask if you were able to use the Readthedocs tutorial? Cheers, Shobi

sinnamone commented 2 years ago

Dear @ShobiStassen ,

I'm taking a look to your tutorial linked in the above message ignoring the time series.

Before the PCA the dimensions of matrix is usually [row from 10.000 to 1 milion] X [columns < 30 ].

As you suggest I switch from PARC to pyVIA and here https://github.com/luglilab/Cytophenograph/blob/master/PhenoFunctions_v5.py if you could take a look the method "runvia" where I put the executions and the parameters. KNN and Resolution should set by user while others are fixed.

Now I'm doing some test with different dataset of high dimensional cytometry (small - medium - big) to understand if the tuning of parameters could improve the results.

ShobiStassen / VIA

Tutorial for high dimensional cytometry data #17