tokenization of scEmbed

FGQ-FGQ commented 2 weeks ago

scEmbed is an excellent job that provides an dimensionality reduction encoding for scATAC-seq data. When I tried to use it to map my data, I found that it took an extremely long time to run model.encode(adata).Could the author provide the .gtok file of the showcase dataset to help people in the community who are interested in this work to try this model, thanks a lot!

nleroy917 commented 2 weeks ago

Hi @FGQ-FGQ ! Thanks for opening up an issue. Glad you are finding it potentially useful. A few questions:

How many cells do you have in your data?
For this question:

Could the author provide the .gtok file of the showcase dataset

Which showcase dataset are you interested in? Is it the Luecken2021 dataset? I might need to create a new function for the model to generate embeddings from .gtok files.

Alternatively, I could provide a .npy file which is a matrix of dimension (n_cells, 100) and you should be able to import that as such:

import numpy as np
import scanpy as sc

showcase_adata = sc.read_h5ad("path/to/showcase.h5ad")
embeddings = np.load("path/to/showcase_embeddings")

showcase_adata.obsm["scembed"] = embeddings

Finally... Indeed, tokenization of datasets can take a bit of time, so I think it is a good idea to be able to sort of "pre-tokenize" the data and enable sharing/embedding generation that way. If you run this:

model.tokenizer.verbose = True

It should give you some tokenization progress.

Let me know if you have any other questions!

FGQ-FGQ commented 2 weeks ago

Thank you for your reply! I have read the manuscript published by you and your colleagues, but I'm not sure if I fully understand it. Please kindly correct me if I am wrong: After pretraining on a reference dataset, any new dataset can generate new cell embeddings by finding overlapping regions. Does this mean that I only need the tokens from your pretrained dataset?

Additionally, one small question: Is it possible to use BERT to model sc-ATAC-seq data?

nleroy917 commented 2 weeks ago

After pretraining on a reference dataset, any new dataset can generate new cell embeddings by finding overlapping regions.

Yes, correct. This assumes that the new dataset is the same organism and is aligned to the same reference genome. The idea is that you can take a nice reference set, train a model, and use it to generate embeddings of new data. This is useful for cell-type annotation, clustering, and reference mapping! We have a few models we've trained on huggingface that might be useful. A notable one is the luecken2021 model. This was trained a well-annotated accessibility profile of bone-marrow cells. I believe you can use it as such:

from umap import UMAP
from geniml.scembed import ScEmbed

model = ScEmbed("databio/r2v-luecken2021-hg38-v2")
embeddings = model.encode("path/to/new.h5ad")

umap = UMAP(n_components=2)
umap_embeddings = umap.fit_transform(embeddings)

# plot using favorite plotting library

Does this mean that I only need the tokens from your pretrained dataset?

You only need the tokens if you wish to skip tokenization for the reference set somehow. But I think you may be looking for a pre-trained model.. not the pre-tokenized data. Let me know if that's the case!

Additionally, one small question: Is it possible to use BERT to model sc-ATAC-seq data?

Great question. Yes, we have already made much progress on that model and hope to publish it soon.. so stay tuned 😀

Keep the questions coming! Its useful to get feedback from the community; I'll try my best to help out!

FGQ-FGQ commented 2 weeks ago

How to define a "nice" reference set? With a sufficiently large dataset, it could indeed become a highly representative and broadly distributed set. However, it’s important to note that scATAC-seq data is highly sparse, and each cell’s accessible regions rarely fully overlap, even in closely related areas. The challenge of tokenizing the entire genome is significant—perhaps it requires binning or other smarter approaches to handle this.

I'm very much looking forward to seeing your work, as modeling the intrinsic correlations of chromatin accessibility using deep learning on large datasets could be incredibly impactful. I’ll keep an eye on your latest updates. Best of luck with everything!

nleroy917 commented 1 week ago

However, it’s important to note that scATAC-seq data is highly sparse, and each cell’s accessible regions rarely fully overlap, even in closely related areas.

Yes this is true! The tokenization procedure considers this and its is quite flexible. Two open chromatin regions from two cells only need to partially overlap to be considered "the same" for the purposes of modeling.

Thanks so much for the feedback let me know if you have any other questions!

databio / geniml

tokenization of scEmbed #7