biomap-research / scFoundation

Apache License 2.0
236 stars 37 forks source link

scFoundation gene embeddings for GEARS #35

Open rvinas opened 3 months ago

rvinas commented 3 months ago

Hello, thank you for your work and the code. I am trying to understand how the scFoundation embeddings were used within the GEARS framework. In the paper, you mention:

(...) In our method, we obtained gene context embeddings for each cell from the scFoundation decoder and set these embeddings as the nodes in the graph (Methods), resulting in a cell-specific gene co-expression graph for predicting perturbations.

How was the cell-specific gene co-expression graph constructed exactly? I was examining your code and I believe this happens here. Could you clarify what the variable pre_in represents? Am I correct in thinking that the GEARS data loader provides the expression of perturbed single-cells in data.x? My understanding from your paper is that the scFoundation embeddings are extracted using control cells only.

Your help would be greatly appreciated!

WhirlFirst commented 3 months ago

Hi, for the details of constructing gene co-expression graph, you may need to also read the original GEARS paper. https://www.nature.com/articles/s41587-023-01905-6 pre_in represents the unperturbed cells. We used the same dataloader from GEARS. What we did was replace the randomly initialized gene embeddings of the original GEARS model with the contextual gene embedding from our model. The edges in the gene co-expression graph remain unchanged.

rvinas commented 3 months ago

Thank you for the clarification! I now understand why pre_in represents the unperturbed cells. In the create_cell_graph_dataset function, control cells are sampled at random and their expression is then stored in data.x. Do you have any intuition on why the contextual gene embeddings from scFoundation are helpful for that task, considering that control cells are sampled at random? I wonder why the contextual aspect is important, given that the sampled control cell is unrelated to the perturbed cell.

WhirlFirst commented 3 months ago

Happy to know that you figure out the code. As for the contextual embeddings, I think that the contextual gene embeddings offer a more flexible input for the model. This variety of input data may make the model easier to learn the distribution of the input data and predict the results well. Also, the contextual embeddings contain more information about the gene expression level compared with the random initialized one, which is another gain for better prediction.

rvinas commented 3 months ago

I see, thank you for your insights. Did you try conditioning the GEARS model on your learnt, non-contextual gene embeddings? (i.e. the gene name embeddings). In other words, can the performance gain be explained by the quality of gene embeddings as opposed to the contextual aspect? I am still unsure why it is helpful to condition the model on random control cells.