Is there a way we can add in our own reference as training data

Sudheshna30 commented 4 days ago

Description of feature

Adding our own reference would be a great way to run this pipeline.

marcovarrone commented 1 day ago

Hi @Sudheshna30, what do you mean exactly by using our own reference?

Do you mean for generating the embedding or for clustering samples? For the first one you can simply train your own scVI or trVAE model using the official tutorials of the packages.

For fitting the clustering model on a dataset and then clustering on a different dataset you can use You can use tl.Cluster.fit on the first dataset and then tl.Cluster.predict on the other one.

I hope I understood the question, let me know if you meant something else :)

Sudheshna30 commented 15 hours ago

Thank you for responding Marco! I really appreciate it!

Im interested in the second method of clustering model on a reference dataset and apply that knowledge to the actual dataset. We tried with the ceelcharter on our pancreatic cosmx dataset and didn't see good results of clustering so Im looking into see if we can actually train the model on a reference dataset and use that to cluster the original dataset. can you help me with an example on how to apply tl.cluster.fit and tl.Cluster.predict https://cellcharter.readthedocs.io/en/latest/generated/cellcharter.tl.Cluster.html#cellcharter.tl.Cluster.predict ?

Best

On Tue, Jul 2, 2024 at 3:37 AM Marco @.***> wrote:

Hi @Sudheshna30 https://github.com/Sudheshna30, what do you mean exactly by using our own reference?

Do you mean for generating the embedding or for clustering samples? For the first one you can simply train your own scVI or trVAE model using the official tutorials of the packages.

For fitting the clustering model on a dataset and then clustering on a different dataset you can use You can use tl.Cluster.fit https://cellcharter.readthedocs.io/en/latest/generated/cellcharter.tl.Cluster.html#cellcharter.tl.Cluster.fit on the first dataset and then tl.Cluster.predict https://cellcharter.readthedocs.io/en/latest/generated/cellcharter.tl.Cluster.html#cellcharter.tl.Cluster.predict on the other one.

I hope I understood the question, let me know if you meant something else :)

— Reply to this email directly, view it on GitHub https://github.com/CSOgroup/cellcharter/issues/46#issuecomment-2202182307, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZC5HITY2TBJKX347C2JFHTZKJKC7AVCNFSM6AAAAABKCB2EX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBSGE4DEMZQG4 . You are receiving this because you were mentioned.Message ID: @.***>

marcovarrone commented 4 hours ago

Hi @Sudheshna30. If I can ask, what was not good in your results for the pancreatic CosMx? You are actually the second person who told me that CellCharter didn't work so well on pancreatic tissue, so I am curious about whether there is something specific in the tissue structure that requires different parameters for CellCharter. If you want to show me some images to better understand the problem you can send me an email at marco.varrone@unil.ch.

Regarding fit and predict you can look at the CosMx tutorial . There I used them on the same dataset but nothing prevents you from processing the two datasets in the same way and using fit on the reference dataset and predict based on your dataset. In the tutorial I used ClusterAutoK rather than Cluster to estimate the best number of clusters (but it requires more runtime, so if you are just exploring I would suggest you to use Cluster).

So basically what you would do is:

Compute the spatial neighbors for both datasets
Train a scVI model on the reference dataset and extract the features for both datasets
Run cc.tl.Cluster.fit on the reference dataset
Run cc.tl.Cluster.predict on your dataset

However, this implies that there are no strong batch effects between the reference dataset and your datasets, otherwise the features from scVI trained on the reference dataset will not work well for your dataset. If there are batch effects, you may want concatenate the two dataset and set adata.obs['dataset'] equal to the dataset associated to every cell, and then train a scVI model on both datasets together using batch_key='dataset'. Then do cc.Cluster.fit at this point on both datasets together and cc.Cluster.predict on your dataset.

It may be a bit of work and not necessarily help a lot unless the reference dataset is quite similar to your dataset, so as I mentioned at the beginning I suggest you to share with me why you think the results are not good, so that we can figure out together how to improve it instead of using a reference dataset.

CSOgroup / cellcharter

Is there a way we can add in our own reference as training data #46

Description of feature